Strings in Swift 4

manifesto
string

(Ben Cohen) #1

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!
   
For Swift 4 and beyond we want to improve three dimensions of text processing:

  1. Ergonomics
  2. Correctness
  3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

  * Restoring `Collection` conformance and dropping the `.characters` view.
  * Providing a more general, composable slicing syntax.
  * Altering `Comparable` so that parameterized
    (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
  * Clearly separating language-dependent operations on text produced
    by and for humans from language-independent
    operations on text produced by and for machine processing.
  * Relocating APIs that fall outside the domain of basic string processing and
    discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
   as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
    guidance in the
    [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
   accessible functions appropriate to that use case.
   
2. The most general localized operations require a locale parameter not required
   by their un-localized counterparts. This naturally skews complexity towards
   localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

  x.compared(to: y, case: .sensitive, in: swissGerman)
  
  x.lowercased(in: .currentLocale)
  
  x.allMatches(
    somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
  static var currentLocale: Locale { ... }
}

extension Unicode {
  // An example of the option language in declaration context,
  // with nil defaults indicating unspecified, so defaults can be
  // driven by the presence/absence of a specific Locale
  func frobnicated(
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
   sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
 func compared(to: Self) -> SortOrder
 ...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
 func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

  * Collection-like operations encourage experimentation with strings to
    investigate and understand their behavior. This is useful for teaching new
    programmers, but also good for experienced programmers who want to
    understand more about strings/unicode.
    
  * Extended grapheme clusters form a natural element boundary for Unicode
    strings. For example, searching and matching operations will always produce
    results that line up on grapheme cluster boundaries.
    
  * Character-by-character processing is a legitimate thing to do in many real
    use-cases, including parsing, pattern matching, and language-specific
    transformations such as transliteration.
    
  * `Collection` conformance makes a wide variety of powerful operations
    available that are appropriate to `String`'s default role as the vehicle for
    machine processed text.
    
    The methods `String` would inherit from `Collection`, where similar to
    higher-level string algorithms, have the right semantics. For example,
    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
    `flatMap` with case-conversion, produce the same results one would expect
    from whole-string ordering comparison, equality comparison, and
    case-conversion, respectively. `reverse` operates correctly on graphemes,
    keeping diacritics moored to their base characters and leaving emoji intact.
    Other methods such as `indexOf` and `contains` make obvious sense. A few
    `Collection` methods, like `min` and `max`, may not be particularly useful
    on `String`, but we don't consider that to be a problem worth solving, in
    the same way that we wouldn't try to suppress `min` and `max` on a
    `Set([UInt8])` that was used to store IP addresses.
    
  * Many of the higher-level operations that we want to provide for `String`s,
    such as parsing and pattern matching, should apply to any `Collection`, and
    many of the benefits we want for `Collections`, such
    as unified slicing, should accrue
    equally to `String`. Making `String` part of the same protocol hierarchy
    allows us to write these operations once and not worry about keeping the
    benefits in sync.
    
  * Slicing strings into substrings is a crucial part of the vocabulary of
    string processing, and all other sliceable things are `Collection`s.
    Because of its collection-like behavior, users naturally think of `String`
    in collection terms, but run into frustrating limitations where it fails to
    conform and are left to wonder where all the differences lie. Many simply
    “correct” this limitation by declaring a trivial conformance:
    
    ```swift
  extension String : BidirectionalCollection {}
    ```
    
    Even if we removed indexing-by-element from `String`, users could still do
    this:
    
    ```swift
      extension String : BidirectionalCollection {
        subscript(i: Index) -> Character { return characters[i] }
      }
    ```
    
    It would be much better to legitimize the conformance to `Collection` and
    simply document the oddity of any concatenation corner-cases, than to deny
    users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
   of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
   that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
   generalized, queries of unicode properties that should also be exposed on
   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

  * Slices with two explicit endpoints are done with subscript, and support
    in-place mutation:
    
    ```swift
        s[i..<j].mutate()
    ```

  * Slicing from an index to the end, or from the start to an index, is done
    with a method and does not support in-place mutation:
    ```swift
        s.prefix(upTo: i).readOnly()
    ```

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
  subscript() -> SubSequence { 
    return self[startIndex..<endIndex] 
  }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
  someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
    func containsChar(_ x: Character) -> Bool {
        return !isEmpty && (first == x || dropFirst().containsChar(x))
    }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
    // add optional argument tracking progress through the string
    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
        let idx = idx ?? startIndex
        return idx != endIndex
            && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
    }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
  Swift, and should be considered a sketch rather than a final design.
  

protocol Unicode 
  : Comparable, BidirectionalCollection where Element == Character {
  
  associatedtype Encoding : UnicodeEncoding
  var encoding: Encoding { get }
  
  associatedtype CodeUnits 
    : RandomAccessCollection where Element == Encoding.CodeUnit
  var codeUnits: CodeUnits { get }
  
  associatedtype UnicodeScalars 
    : BidirectionalCollection  where Element == UnicodeScalar
  var unicodeScalars: UnicodeScalars { get }

  associatedtype ExtendedASCII 
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

  var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
  // ... define high-level non-mutating string operations, e.g. search ...

  func compared<Other: Unicode>(
    to rhs: Other,
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection {
    // Satisfy protocol requirement
    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
      where C.Element == Element
  
  // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
  somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
   ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
  the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

  1. `String` views cannot consume one anothers' indices without a cumbersome
    conversion step. An index into a `String`'s `characters` must be translated
    before it can be used as a position in its `unicodeScalars`. Although these
    translations are rarely needed, they add conceptual and API complexity.
  2. Many APIs in the core libraries and other frameworks still expose `String`
    positions as `Int`s and regions as `NSRange`s, which can only reference a
    `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
   index points partway through a grapheme.
   
- `String.index(before:)` should move to the start of the grapheme before
   the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
  func index(offset: IndexDistance) -> Index {
    return index(startIndex, offsetBy: offset)
  }
  func offset(of i: Index) -> IndexDistance {
    return distance(from: startIndex, to: i)
  }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

  * Matching the type of data being formatted to a formatter type
  * Creating an instance of that type
  * Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
    need for this step prevents the instance from being used and discarded in
    the same expression where it is created.
  * Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
    types used in string interpolation.
  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
    distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
  
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
    
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
   enable unified slicing syntax.

2. **A subtype relationship** between
   `Substring` and `String`, enabling framework APIs to traffic solely in
   `String` while still making it possible to avoid copies by handling
   `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
  question here; this is about what encodings must be storable, without
  transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
  which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
  to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
  `UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
  be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

  // ...APIs for high-level string processing here...
  
  var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
  : BidirectionalCollection {

  // ...APIs for high-level string processing here...
  
  var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
    substantially reduce the scope of `Int`'s API by using more
    generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
  the Unicode standard for this purpose. In fact there's
  a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
  dedicated to it. In particular, §5.17 says:

  > When comparing text that is visible to end users, a correct linguistic sort
  > should be used, as described in _Section 5.16, Sorting and
  > Searching_. However, in many circumstances the only requirement is for a
  > fast, well-defined ordering. In such cases, a binary ordering can be used.

  [:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)


Shorthand for Offsetting startIndex and endIndex
(Saagar Jha) #2

Looks pretty good in general from my quick glance–at least, it’s much better than the current situation. I do have a couple of comments and questions, which I’ve inlined below.

Saagar Jha

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Minor typo: acheiving->achieving

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

Any reason why Locale is defaulted to nil, instead of currentLocale? It seems more useful to me.

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

Why not both? Have the “UFO” operator, with the methods as support for more complicated use cases where the sugar doesn’t hold up.

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Will String also conform to SequenceType? I’ve seen many users (coming from other languages) confused that they can’t “just” loop over a String’s characters.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

Another minor typo: capitalize “Swift"

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

+100, this kind of work is currently quite painful in Swift. Looking forward to seeing this implemented!

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

Another thing that might limit adoption is the verbosity of this format. It works fine if I need to print one or two things, but it gets unwieldy very quickly.

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

There are some things that are know to lie entirely with ASCII–are there any plans to add a way to work with them in a simple manner (subscripting, looping, etc.), possibly through the use of a Array<ASCIIChar>? property or whatever?

···

On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(David Sweeris) #3

Regarding substrings... Instead of having separate `ArraySlice` and `Substring` types, what about having just one type, `Slice<T: Sequence>`, for anything which shares memory? Seems like it'd be easier for users who'd only have to worry about shared storage for one type, and for stdlib authors who'd only have to write it once.

Of course, that assumes it actually would be easier to only have to write/maintain one such "shared memory" type... If that's not the case, I'm not sure the smaller API surface is worth it.

Anyway, I haven't finished reading through it all yet, but I like what I've seen so far.

- Dave Sweeris

···

On Jan 19, 2017, at 20:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave


(David Sweeris) #4

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

An enthusiastic +1

A couple more quick thoughts...

1) Is it just me, or is explicitly putting some of the "higher level" functionality in Foundation instead of stdlib kinda reminiscent of MVC? I guess UIKit/Cocoa would be the "View" part.

2) I like the idea of making String generic over its encoding... Would we need to nail down the hypothetical type promotion system for that to work, or can it all be handled internally?

- Dave Sweeris

···

Sent from my iPhone

On Jan 19, 2017, at 20:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:


(Xiaodi Wu) #5

Clearly too big to digest in one take. Some initial thoughts:

* Not sure about the wisdom of the ad-hoc Substring : String compiler
magic. It seems that whatever needs overcoming here would be equally
relevant for ArraySlice. It would be more design work, but perhaps not
terribly more implementation work, to have a magical protocol that allows
the compiler to apply a similar magic to conforming types (e.g. a
`ImplicitlyConvertibleSlice` protocol with an associated type, to which
ArraySlice and String could both conform). Alternatively, perhaps all of
this is not truly necessary for sufficient ergonomics.

* A requirement to transcode UTF-8 strings to UTF-16 for storage
seems...inefficient? Why any hesitation at all to expose UTF-8-encoded code
units as UInt16? Sure, there are going to be unused bits, but so what? If I
understand it correctly, it's only the concrete type exposed on String for
code units that's in play here; the backing representations themselves can
use whatever is most efficient. So, why _not_ support UTF-32 and expose all
code units as UInt32? Isn't that exactly paralleling the design for the
extendedASCII view, where users get ASCII characters back as UInt32 and
encoding-specific code units as such as well?

* Are the backing representations for String also the same types that can
be exposed statically (as in the mentioned `NFCNormalizedUTF16String`)?

* Why `withCString` with a closure instead of just `cString` returning
[CChar]? Particularly if the backing store isn't UTF8, isn't the C string
going to have to be a newly allocated buffer anyway? Personally, I find the
current `utf8CString` to be quite convenient :stuck_out_tongue:

···

On Thu, Jan 19, 2017 at 8:56 PM, Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](
https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined
thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20160725/025676.html):

> **String re-evaluation**: String is one of the most important fundamental
> types in the language. The standard library leads have numerous ideas
of how
> to improve the programming model for it, without jeopardizing the goals
of
> providing a unicode-correct-by-default model. Our goal is to be better
at
> string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text
processing:

  1. Ergonomics
  2. Correctness
  3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the
scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are
mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics
or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in
the
foundation additions to `String` where patterns and practices found
elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily
understood
given a signature and a one-line summary. Today, `String` fails that
test. As
you can see, the Standard Library and Foundation both contribute
significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0]
String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

  * Restoring `Collection` conformance and dropping the `.characters` view.
  * Providing a more general, composable slicing syntax.
  * Altering `Comparable` so that parameterized
    (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
  * Clearly separating language-dependent operations on text produced
    by and for humans from language-independent
    operations on text produced by and for machine processing.
  * Relocating APIs that fall outside the domain of basic string
processing and
    discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs
for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for
domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/
foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle
behind
Swift's `String`. That said, the Unicode standard is an evolving
document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs
should
be written so their correctness does not depend on precise stability of
these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate
users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and
should be
improved, the sheer size, complexity, and diversity of these APIs is a
major
contributor to the problem, causing novices to tune out, and more
experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
   as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
    guidance in the
    [Internationalization and Localization Guide](https://developer.
apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/
InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make
localized
operations more teachable, comprehensible, and approachable, thereby
lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many
operations on
Swift `String` (and `NSString`) are really only appropriate for text that
is
intended to be processed for, and consumed by, machines. The semantics of
the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
   accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not
required
   by their un-localized counterparts. This naturally skews complexity
towards
   localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of
clarifying
the proper default behavior of operations such as comparison, and allows
us to
make [significant optimizations](#collation-semantics) that were
previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but
regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by
default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that
naïve
code is very likely to be correct if it compiles, and that more
sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need
to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different
for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized
sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly
like this:

  x.compared(to: y, case: .sensitive, in: swissGerman)

  x.lowercased(in: .currentLocale)

  x.allMatches(
    somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
  static var currentLocale: Locale { ... }
}

extension Unicode {
  // An example of the option language in declaration context,
  // with nil defaults indicating unspecified, so defaults can be
  // driven by the presence/absence of a specific Locale
  func frobnicated(
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing—
turns
out to be quite interesting, once you pick it apart. The full Unicode
Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a
collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
   sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching
*sequences* of
unicode scalars in the normalized string to *sequences* of triples, which
get
accumulated into a collation key. Predictably, this is where the real
costs
lie.

*However*, there are some bright spots to this story. First, as it turns
out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_
Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it
does
for equality: two strings that normalize the same, naturally, will collate
the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare
the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to
hashing: it
is sufficient to hash the string's normalized form, bypassing collation
keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level
applies
even to localized strings, it means that hashing and equality can be
implemented
exactly the same way for localized and non-localized text, and hash tables
with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to
handle
machine-generated and machine-readable text, the default ordering of
`String`s
need no longer use the UCA at all. It is sufficient to order them in any
way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of
grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and
ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting
behavior
consistent across platforms. Currently, we sort `String` according to the
UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are
ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with
binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into
the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/
f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-
the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather
than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
 func compared(to: Self) -> SortOrder
 ...
}

This change will give us a syntactic platform on which to implement
methods with
additional, defaulted arguments, thereby unifying and regularizing
comparison
across the library.

extension String {
 func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also
possible
that the standard library simply adopts Foundation's `ComparisonResult` as
is,
but we believe the community should at least consider alternate naming
before
that happens. There will be an opportunity to discuss the choices in
detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/
f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection`
too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`,
`elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0,
`String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of
combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes
appears to
comport perfectly with Unicode. We think the concatenation problem is
tolerable,
because the cases where it occurs all represent partially-formed
constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING
ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_
Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The
other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text
editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases
that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can
handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

  * Collection-like operations encourage experimentation with strings to
    investigate and understand their behavior. This is useful for teaching
new
    programmers, but also good for experienced programmers who want to
    understand more about strings/unicode.

  * Extended grapheme clusters form a natural element boundary for Unicode
    strings. For example, searching and matching operations will always
produce
    results that line up on grapheme cluster boundaries.

  * Character-by-character processing is a legitimate thing to do in many
real
    use-cases, including parsing, pattern matching, and language-specific
    transformations such as transliteration.

  * `Collection` conformance makes a wide variety of powerful operations
    available that are appropriate to `String`'s default role as the
vehicle for
    machine processed text.

    The methods `String` would inherit from `Collection`, where similar to
    higher-level string algorithms, have the right semantics. For example,
    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and
application of
    `flatMap` with case-conversion, produce the same results one would
expect
    from whole-string ordering comparison, equality comparison, and
    case-conversion, respectively. `reverse` operates correctly on
graphemes,
    keeping diacritics moored to their base characters and leaving emoji
intact.
    Other methods such as `indexOf` and `contains` make obvious sense. A
few
    `Collection` methods, like `min` and `max`, may not be particularly
useful
    on `String`, but we don't consider that to be a problem worth solving,
in
    the same way that we wouldn't try to suppress `min` and `max` on a
    `Set([UInt8])` that was used to store IP addresses.

  * Many of the higher-level operations that we want to provide for
`String`s,
    such as parsing and pattern matching, should apply to any
`Collection`, and
    many of the benefits we want for `Collections`, such
    as unified slicing, should accrue
    equally to `String`. Making `String` part of the same protocol
hierarchy
    allows us to write these operations once and not worry about keeping
the
    benefits in sync.

  * Slicing strings into substrings is a crucial part of the vocabulary of
    string processing, and all other sliceable things are `Collection`s.
    Because of its collection-like behavior, users naturally think of
`String`
    in collection terms, but run into frustrating limitations where it
fails to
    conform and are left to wonder where all the differences lie. Many
simply
    “correct” this limitation by declaring a trivial conformance:

    ```swift
  extension String : BidirectionalCollection {}
    ```

    Even if we removed indexing-by-element from `String`, users could
still do
    this:

    ```swift
      extension String : BidirectionalCollection {
        subscript(i: Index) -> Character { return characters[i] }
      }
    ```

    It would be much better to legitimize the conformance to `Collection`
and
    simply document the oddity of any concatenation corner-cases, than to
deny
    users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not*
mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_
Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this,
we should:

- Add a `unicodeScalars` view much like `String`'s, so that the
sub-structure
   of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for
sequences
   that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
   generalized, queries of unicode properties that should also be exposed
on
   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift
`UnicodeScalar`
type. This means it is usable on `String`, but only by going through the
unicode
scalar view. To deal with this clash in the short term, `CharacterSet`
should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate
to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling
and
their naming:

  * Slices with two explicit endpoints are done with subscript, and support
    in-place mutation:

    ```swift
        s[i..<j].mutate()
    ```

  * Slicing from an index to the end, or from the start to an index, is
done
    with a method and does not support in-place mutation:
    ```swift
        s.prefix(upTo: i).readOnly()
    ```

Prefix and suffix operations should be migrated to be subscripting
operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/
9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-
sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse
a wide
variety of methods and subscript overloads into a single implementation,
and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like
`s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three
options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when
making the substring.
3. Make substrings a different type, with a storage copy on conversion to
string.

We think number 3 is the best choice. A walk-through of the tradeoffs
follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view
into a
subrange of the original `String`'s storage. This is why `String` is 3
words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings
into
multiple smaller strings. But it does mean that a stored substring keeps
the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming
languages,
because applications sometimes extract small strings from large ones and
keep
those small strings long-term. That is considered a memory leak and was
enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but
has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For
example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also
brings
this sub-range capability to any API that operates on `String` "for free".
So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the
`compare`
method itself. The implementation of `compare` does not need to know
anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory
leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice`
type.
The inconvenience of a separate type is mitigated by most operations used
on
`Array` from the standard library being generic over `Sequence` or
`Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice`
would apply to `Substring`:

> Important: Long-term storage of `Substring` instances is discouraged. A
> substring holds a reference to the entire storage of a larger string, not
> just to the portion it presents, even after the original string's
lifetime
> ends. Long-term storage of a `Substring` may therefore prolong the
lifetime
> of large strings that are no longer otherwise accessible, which can
appear
> to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be
performed, and
at this point the substring buffer is copied and the original string's
storage
can be released.

A `String` that was not its own `Substring` could be one word—a single
tagged
pointer—without requiring additional allocations. `Substring`s would be a
view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage
of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this
would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant
to
existing code that assumes `String` is the currency type. To ease the pain
of
type mismatches, `Substring` should be a subtype of `String` in the same
way
that `Int` is a subtype of `Optional<Int>`. This would give users an
implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type
they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is
always
a reasonable default**. A `Substring` passed where `String` is expected
will be
implicitly copied. When compared to the “same type, copied storage” model,
we
have effectively deferred the cost of copying from the point where a
substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this
guideline:
if for performance reasons you are tempted to add a `Range` argument to
your
method as well as a `String` to avoid unnecessary copies, you should
instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a
`String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you
have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
  subscript() -> SubSequence {
    return self[startIndex..<endIndex]
  }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for
an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be
done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are
proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy
of
the underlying storage when it detects the string is being "stored" for
long
term usage, say when it is assigned to a stored property. The trouble with
this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely
on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array
of
substrings. It would also be difficult to distinguish intentional
medium-term
storage of substrings, say by a lexer. There does not appear to be an
effective
consistent rule that could be applied in the general case for detecting
when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage",
the
optimizer could be enhanced to to reduce the impact of some of those
copies.
For example, this code could be optimized to pull the invariant substring
out
of the loop:

for _ in 0..<lots {
  someFunc(takingString: bigString[bigRange])
}

It's worth noting that a similar optimization is needed to avoid an
equivalent
problem with implicit conversion in the "different type, shared storage"
case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use
cases
that cannot be optimized as easily. Consider the following simple
definition of
a recursive `contains` algorithm, which when substring slicing is linear
makes
the overall algorithm quadratic:

extension String {
    func containsChar(_ x: Character) -> Bool {
        return !isEmpty && (first == x || dropFirst().containsChar(x))
    }
}

For the optimizer to eliminate this problem is unrealistic, forcing the
user to
remember to optimize the code to not use string slicing if they want it to
be
efficient (assuming they remember):

extension String {
    // add optional argument tracking progress through the string
    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) ->
Bool {
        let idx = idx ?? startIndex
        return idx != endIndex
            && (self[idx] == x || containsCharacter(x, atOrAfter:
index(after: idx)))
    }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the
non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be
imported
without the `NSRange` argument. The Objective-C importer should be
changed to
give these APIs special treatment so that when a `Substring` is passed,
instead
of being converted to a `String`, the full `NSString` and range are passed
to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help
users
manually handle any cases that remain, Foundation should be augmented to
allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a
protocol.
Since Unicode conformance is a key feature of string processing in swift,
we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not
yet implemented in
  Swift, and should be considered a sketch rather than a final design.

protocol Unicode
  : Comparable, BidirectionalCollection where Element == Character {

  associatedtype Encoding : UnicodeEncoding
  var encoding: Encoding { get }

  associatedtype CodeUnits
    : RandomAccessCollection where Element == Encoding.CodeUnit
  var codeUnits: CodeUnits { get }

  associatedtype UnicodeScalars
    : BidirectionalCollection  where Element == UnicodeScalar
  var unicodeScalars: UnicodeScalars { get }

  associatedtype ExtendedASCII
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

  var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
  // ... define high-level non-mutating string operations, e.g. search ...

  func compared<Other: Unicode>(
    to rhs: Other,
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection {
    // Satisfy protocol requirement
    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
      where C.Element == Element

  // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units
in
such a way that for types with a known representation (e.g. a
high-performance
`UTF8String`) that information can be known at compile-time and can be
used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For
example,
it should be easy to cleanly express, “if this string starts with `"f"`,
process
the rest of the string as follows…” Swift is well-suited to expressing
this
common pattern beautifully, but we need to add the APIs. Here are two
examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
  somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
   ...
}

The specific spelling and functionality of APIs like this are TBD. The
larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
  the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/
Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals
into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the
indices of
the range from which it was sliced, operations like `firstMatch` can
return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match
in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general,
and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol
mentioned
above provides a suitable foundation for regular expressions, and types
such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could
allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`,
and
`utf16`—each with its own opaque index type. The APIs used to translate
indices
between views add needless complexity, and the opacity of indices makes
them
difficult to serialize.

The index translation problem has two aspects:

  1. `String` views cannot consume one anothers' indices without a
cumbersome
    conversion step. An index into a `String`'s `characters` must be
translated
    before it can be used as a position in its `unicodeScalars`. Although
these
    translations are rarely needed, they add conceptual and API complexity.
  2. Many APIs in the core libraries and other frameworks still expose
`String`
    positions as `Int`s and regions as `NSRange`s, which can only
reference a
    `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient
indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type.
Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets
into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think
random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and
constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a
`String`
between grapheme cluster boundaries are TBD—it can either trap or be
forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will
do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when
the
   index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
   the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views
is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-
and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
  func index(offset: IndexDistance) -> Index {
    return index(startIndex, offsetBy: offset)
  }
  func offset(of i: Index) -> IndexDistance {
    return distance(from: startIndex, to: i)
  }
}

Then integers can easily be translated into offsets into a `String`'s
`utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide
future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format
string with
textual placeholders for substitution, and an arbitrary list of other
arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and
complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the
arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but
only for
the cases where the format string is a literal. Second, there's no
reasonable
way to extend the formatting vocabulary to cover the needs of new types:
you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile,
offering
both formatting and parsing services. When used for formatting, though,
the
design pattern demands more from users than it should:

  * Matching the type of data being formatted to a formatter type
  * Creating an instance of that type
  * Setting stateful options (`currency`, `dateStyle`) on the type. Note:
the
    need for this step prevents the instance from being used and discarded
in
    the same expression where it is created.
  * Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such
that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type
safety
problems (put the data right where it belongs!) but the following issues
prevent
it from being useful for localized formatting (among other jobs):

  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to
restrict
    types used in string interpolation.
  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation
can't
    distinguish (fragments of) the base string from the string
substitutions.

In the long run, we should improve Swift string interpolation to the point
where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs
to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered
and
incoherent, with 6 ways to transform a C string into a `String` and four
ways to
do the inverse. These APIs should be replaced with the following

extension String {
  /// Constructs a `String` having the same contents as
`nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as
`nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code
units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented
as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected
will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the
validity of
the result is recorded in the `String`'s `encoding` such that future
accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired
on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which
changes
the process of properly identifying `Character` boundaries. We need to
update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored
using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some
come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by
many
different representations.

That said, the highest performance code always requires static knowledge
of the
data structures on which it operates, and for this code, dynamic selection
of
representation comes at too high a cost. Heavy-duty text processing
demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles
dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that
encode
representation information into the type, such as
`NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies
entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are
often
processed most efficiently by recognizing ASCII structural elements as
ASCII,
and capturing the arbitrary sections between them in more-general
strings. The
current String API offers no way to efficiently recognize ASCII and skip
past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is
specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
   enable unified slicing syntax.

2. **A subtype relationship** between
   `Substring` and `String`, enabling framework APIs to traffic solely in
   `String` while still making it possible to avoid copies by handling
   `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is
not in
  question here; this is about what encodings must be storable, without
  transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete
type in
  which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code
units need
  to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design
choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
  `UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String`
indices can
  be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able
to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an
API
appropriate for `String`. Instead, string APIs would be provided by a
generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such
as
access to the specific encoding, by putting them behind a `.unicode`
property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage>
  : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the
right
type” (`String`) without thinking, and the new APIs will show up on
`Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the
word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and
kept
around because it seemed useful, but it was never truly *designed* for
client
programmers. We need to decide what happens with it. Presumably
*something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
    substantially reduce the scope of `Int`'s API by using more
    generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard
and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://
unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme
clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary
would
not be found in the result), so would be relatively costly to implement,
with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned
by
  the Unicode standard for this purpose. In fact there's
  a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
  dedicated to it. In particular, §5.17 says:

  > When comparing text that is visible to end users, a correct linguistic
sort
  > should be used, as described in _Section 5.16, Sorting and
  > Searching_. However, in many circumstances the only requirement is for
a
  > fast, well-defined ordering. In such cases, a binary ordering can be
used.

  [:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly
onto
properties in a table that's indexed by unicode scalar value. This table
is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd
either
establish the existing practice that the Unicode committee would
standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Josh Parmenter) #6

I’ve just done a first read through of this, and in general I feel like this is very well thought out, and lessons from Swift 2 and 3 are well incorporated. I’ll try to give other feedback later when I have more time, but on an initial glance, thanks for this!
Best,
Josh

···

On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org<mailto:swift-evolution@swift.org>> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence {
   return self[startIndex..<endIndex]
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots {
 someFunc(takingString: bigString[bigRange])
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage>
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Joshua Parmenter | Engineering Lead, Apple Technologies

T 248 777 7777
C 206 437 1551
F 248 616 1980
www.vectorform.com<http://www.vectorform.com/>

Vectorform
2107 Elliott Ave Suite 303
Seattle, WA 98121 USA

Think Tank. Lab. Studio.
We invent digital products and experiences.

SEATTLE | DETROIT | NEW YORK | MUNICH | HYDERABAD


(Shawn Erickson) #7

Huge thanks for putting this out... a lot to read and digest but great to
see.

···

On Thu, Jan 19, 2017 at 6:56 PM Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](
https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined
thus
far, with just this short blurb in the
[list of goals](
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html
):

> **String re-evaluation**: String is one of the most important fundamental
> types in the language. The standard library leads have numerous ideas
of how
> to improve the programming model for it, without jeopardizing the goals
of
> providing a unicode-correct-by-default model. Our goal is to be better
at
> string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text
processing:

  1. Ergonomics
  2. Correctness
  3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the
scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are
mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics
or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in
the
foundation additions to `String` where patterns and practices found
elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily
understood
given a signature and a one-line summary. Today, `String` fails that
test. As
you can see, the Standard Library and Foundation both contribute
significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0]
String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

  * Restoring `Collection` conformance and dropping the `.characters` view.
  * Providing a more general, composable slicing syntax.
  * Altering `Comparable` so that parameterized
    (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
  * Clearly separating language-dependent operations on text produced
    by and for humans from language-independent
    operations on text produced by and for machine processing.
  * Relocating APIs that fall outside the domain of basic string
processing and
    discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs
for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for
domain-specific
jobs such as
[linguistic tagging](
https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle
behind
Swift's `String`. That said, the Unicode standard is an evolving
document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs
should
be written so their correctness does not depend on precise stability of
these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate
users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and
should be
improved, the sheer size, complexity, and diversity of these APIs is a
major
contributor to the problem, causing novices to tune out, and more
experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
   as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
    guidance in the
    [Internationalization and Localization Guide](
https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
).

Along with appropriate documentation updates, these changes will make
localized
operations more teachable, comprehensible, and approachable, thereby
lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many
operations on
Swift `String` (and `NSString`) are really only appropriate for text that
is
intended to be processed for, and consumed by, machines. The semantics of
the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
   accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not
required
   by their un-localized counterparts. This naturally skews complexity
towards
   localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of
clarifying
the proper default behavior of operations such as comparison, and allows
us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but
regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by
default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that
naïve
code is very likely to be correct if it compiles, and that more
sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need
to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different
for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized
sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly
like this:

  x.compared(to: y, case: .sensitive, in: swissGerman)

  x.lowercased(in: .currentLocale)

  x.allMatches(
    somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
  static var currentLocale: Locale { ... }
}

extension Unicode {
  // An example of the option language in declaration context,
  // with nil defaults indicating unspecified, so defaults can be
  // driven by the presence/absence of a specific Locale
  func frobnicated(
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing—
turns
out to be quite interesting, once you pick it apart. The full Unicode
Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a
collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
   sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching
*sequences* of
unicode scalars in the normalized string to *sequences* of triples, which
get
accumulated into a collation key. Predictably, this is where the real
costs
lie.

*However*, there are some bright spots to this story. First, as it turns
out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](
http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it
does
for equality: two strings that normalize the same, naturally, will collate
the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare
the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to
hashing: it
is sufficient to hash the string's normalized form, bypassing collation
keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level
applies
even to localized strings, it means that hashing and equality can be
implemented
exactly the same way for localized and non-localized text, and hash tables
with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to
handle
machine-generated and machine-readable text, the default ordering of
`String`s
need no longer use the UCA at all. It is sufficient to order them in any
way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of
grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and
ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting
behavior
consistent across platforms. Currently, we sort `String` according to the
UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are
ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with
binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](
https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](
https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather
than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
 func compared(to: Self) -> SortOrder
 ...
}

This change will give us a syntactic platform on which to implement
methods with
additional, defaulted arguments, thereby unifying and regularizing
comparison
across the library.

extension String {
 func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also
possible
that the standard library simply adopts Foundation's `ComparisonResult` as
is,
but we believe the community should at least consider alternate naming
before
that happens. There will be an opportunity to discuss the choices in
detail
when the modified
[Comparison Proposal](
https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection`
too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`,
`elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0,
`String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of
combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes
appears to
comport perfectly with Unicode. We think the concatenation problem is
tolerable,
because the cases where it occurs all represent partially-formed
constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING
ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The
other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text
editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases
that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can
handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

  * Collection-like operations encourage experimentation with strings to
    investigate and understand their behavior. This is useful for teaching
new
    programmers, but also good for experienced programmers who want to
    understand more about strings/unicode.

  * Extended grapheme clusters form a natural element boundary for Unicode
    strings. For example, searching and matching operations will always
produce
    results that line up on grapheme cluster boundaries.

  * Character-by-character processing is a legitimate thing to do in many
real
    use-cases, including parsing, pattern matching, and language-specific
    transformations such as transliteration.

  * `Collection` conformance makes a wide variety of powerful operations
    available that are appropriate to `String`'s default role as the
vehicle for
    machine processed text.

    The methods `String` would inherit from `Collection`, where similar to
    higher-level string algorithms, have the right semantics. For example,
    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and
application of
    `flatMap` with case-conversion, produce the same results one would
expect
    from whole-string ordering comparison, equality comparison, and
    case-conversion, respectively. `reverse` operates correctly on
graphemes,
    keeping diacritics moored to their base characters and leaving emoji
intact.
    Other methods such as `indexOf` and `contains` make obvious sense. A
few
    `Collection` methods, like `min` and `max`, may not be particularly
useful
    on `String`, but we don't consider that to be a problem worth solving,
in
    the same way that we wouldn't try to suppress `min` and `max` on a
    `Set([UInt8])` that was used to store IP addresses.

  * Many of the higher-level operations that we want to provide for
`String`s,
    such as parsing and pattern matching, should apply to any
`Collection`, and
    many of the benefits we want for `Collections`, such
    as unified slicing, should accrue
    equally to `String`. Making `String` part of the same protocol
hierarchy
    allows us to write these operations once and not worry about keeping
the
    benefits in sync.

  * Slicing strings into substrings is a crucial part of the vocabulary of
    string processing, and all other sliceable things are `Collection`s.
    Because of its collection-like behavior, users naturally think of
`String`
    in collection terms, but run into frustrating limitations where it
fails to
    conform and are left to wonder where all the differences lie. Many
simply
    “correct” this limitation by declaring a trivial conformance:

    ```swift
  extension String : BidirectionalCollection {}
    ```

    Even if we removed indexing-by-element from `String`, users could
still do
    this:

    ```swift
      extension String : BidirectionalCollection {
        subscript(i: Index) -> Character { return characters[i] }
      }
    ```

    It would be much better to legitimize the conformance to `Collection`
and
    simply document the oddity of any concatenation corner-cases, than to
deny
    users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not*
mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this,
we should:

- Add a `unicodeScalars` view much like `String`'s, so that the
sub-structure
   of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for
sequences
   that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
   generalized, queries of unicode properties that should also be exposed
on
   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift
`UnicodeScalar`
type. This means it is usable on `String`, but only by going through the
unicode
scalar view. To deal with this clash in the short term, `CharacterSet`
should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate
to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling
and
their naming:

  * Slices with two explicit endpoints are done with subscript, and support
    in-place mutation:

    ```swift
        s[i..<j].mutate()
    ```

  * Slicing from an index to the end, or from the start to an index, is
done
    with a method and does not support in-place mutation:
    ```swift
        s.prefix(upTo: i).readOnly()
    ```

Prefix and suffix operations should be migrated to be subscripting
operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](
https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md
).
With generic subscripting in the language, that will allow us to collapse
a wide
variety of methods and subscript overloads into a single implementation,
and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like
`s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three
options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when
making the substring.
3. Make substrings a different type, with a storage copy on conversion to
string.

We think number 3 is the best choice. A walk-through of the tradeoffs
follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view
into a
subrange of the original `String`'s storage. This is why `String` is 3
words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings
into
multiple smaller strings. But it does mean that a stored substring keeps
the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming
languages,
because applications sometimes extract small strings from large ones and
keep
those small strings long-term. That is considered a memory leak and was
enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but
has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For
example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also
brings
this sub-range capability to any API that operates on `String` "for free".
So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the
`compare`
method itself. The implementation of `compare` does not need to know
anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory
leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice`
type.
The inconvenience of a separate type is mitigated by most operations used
on
`Array` from the standard library being generic over `Sequence` or
`Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice`
would apply to `Substring`:

> Important: Long-term storage of `Substring` instances is discouraged. A
> substring holds a reference to the entire storage of a larger string, not
> just to the portion it presents, even after the original string's
lifetime
> ends. Long-term storage of a `Substring` may therefore prolong the
lifetime
> of large strings that are no longer otherwise accessible, which can
appear
> to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be
performed, and
at this point the substring buffer is copied and the original string's
storage
can be released.

A `String` that was not its own `Substring` could be one word—a single
tagged
pointer—without requiring additional allocations. `Substring`s would be a
view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage
of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this
would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant
to
existing code that assumes `String` is the currency type. To ease the pain
of
type mismatches, `Substring` should be a subtype of `String` in the same
way
that `Int` is a subtype of `Optional<Int>`. This would give users an
implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type
they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is
always
a reasonable default**. A `Substring` passed where `String` is expected
will be
implicitly copied. When compared to the “same type, copied storage” model,
we
have effectively deferred the cost of copying from the point where a
substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this
guideline:
if for performance reasons you are tempted to add a `Range` argument to
your
method as well as a `String` to avoid unnecessary copies, you should
instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a
`String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you
have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
  subscript() -> SubSequence {
    return self[startIndex..<endIndex]
  }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for
an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be
done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are
proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy
of
the underlying storage when it detects the string is being "stored" for
long
term usage, say when it is assigned to a stored property. The trouble with
this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely
on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array
of
substrings. It would also be difficult to distinguish intentional
medium-term
storage of substrings, say by a lexer. There does not appear to be an
effective
consistent rule that could be applied in the general case for detecting
when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage",
the
optimizer could be enhanced to to reduce the impact of some of those
copies.
For example, this code could be optimized to pull the invariant substring
out
of the loop:

for _ in 0..<lots {
  someFunc(takingString: bigString[bigRange])
}

It's worth noting that a similar optimization is needed to avoid an
equivalent
problem with implicit conversion in the "different type, shared storage"
case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use
cases
that cannot be optimized as easily. Consider the following simple
definition of
a recursive `contains` algorithm, which when substring slicing is linear
makes
the overall algorithm quadratic:

extension String {
    func containsChar(_ x: Character) -> Bool {
        return !isEmpty && (first == x || dropFirst().containsChar(x))
    }
}

For the optimizer to eliminate this problem is unrealistic, forcing the
user to
remember to optimize the code to not use string slicing if they want it to
be
efficient (assuming they remember):

extension String {
    // add optional argument tracking progress through the string
    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) ->
Bool {
        let idx = idx ?? startIndex
        return idx != endIndex
            && (self[idx] == x || containsCharacter(x, atOrAfter:
index(after: idx)))
    }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the
non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be
imported
without the `NSRange` argument. The Objective-C importer should be
changed to
give these APIs special treatment so that when a `Substring` is passed,
instead
of being converted to a `String`, the full `NSString` and range are passed
to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help
users
manually handle any cases that remain, Foundation should be augmented to
allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a
protocol.
Since Unicode conformance is a key feature of string processing in swift,
we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not
yet implemented in
  Swift, and should be considered a sketch rather than a final design.

protocol Unicode
  : Comparable, BidirectionalCollection where Element == Character {

  associatedtype Encoding : UnicodeEncoding
  var encoding: Encoding { get }

  associatedtype CodeUnits
    : RandomAccessCollection where Element == Encoding.CodeUnit
  var codeUnits: CodeUnits { get }

  associatedtype UnicodeScalars
    : BidirectionalCollection  where Element == UnicodeScalar
  var unicodeScalars: UnicodeScalars { get }

  associatedtype ExtendedASCII
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

  var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
  // ... define high-level non-mutating string operations, e.g. search ...

  func compared<Other: Unicode>(
    to rhs: Other,
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection {
    // Satisfy protocol requirement
    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
      where C.Element == Element

  // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units
in
such a way that for types with a known representation (e.g. a
high-performance
`UTF8String`) that information can be known at compile-time and can be
used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For
example,
it should be easy to cleanly express, “if this string starts with `"f"`,
process
the rest of the string as follows…” Swift is well-suited to expressing
this
common pattern beautifully, but we need to add the APIs. Here are two
examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
  somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
   ...
}

The specific spelling and functionality of APIs like this are TBD. The
larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
  the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](
https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33
),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals
into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the
indices of
the range from which it was sliced, operations like `firstMatch` can
return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match
in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general,
and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol
mentioned
above provides a suitable foundation for regular expressions, and types
such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could
allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`,
and
`utf16`—each with its own opaque index type. The APIs used to translate
indices
between views add needless complexity, and the opacity of indices makes
them
difficult to serialize.

The index translation problem has two aspects:

  1. `String` views cannot consume one anothers' indices without a
cumbersome
    conversion step. An index into a `String`'s `characters` must be
translated
    before it can be used as a position in its `unicodeScalars`. Although
these
    translations are rarely needed, they add conceptual and API complexity.
  2. Many APIs in the core libraries and other frameworks still expose
`String`
    positions as `Int`s and regions as `NSRange`s, which can only
reference a
    `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient
indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type.
Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets
into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think
random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and
constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a
`String`
between grapheme cluster boundaries are TBD—it can either trap or be
forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will
do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when
the
   index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
   the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views
is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this
document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
  func index(offset: IndexDistance) -> Index {
    return index(startIndex, offsetBy: offset)
  }
  func offset(of i: Index) -> IndexDistance {
    return distance(from: startIndex, to: i)
  }
}

Then integers can easily be translated into offsets into a `String`'s
`utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide
future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format
string with
textual placeholders for substitution, and an arbitrary list of other
arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and
complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the
arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but
only for
the cases where the format string is a literal. Second, there's no
reasonable
way to extend the formatting vocabulary to cover the needs of new types:
you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile,
offering
both formatting and parsing services. When used for formatting, though,
the
design pattern demands more from users than it should:

  * Matching the type of data being formatted to a formatter type
  * Creating an instance of that type
  * Setting stateful options (`currency`, `dateStyle`) on the type. Note:
the
    need for this step prevents the instance from being used and discarded
in
    the same expression where it is created.
  * Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such
that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type
safety
problems (put the data right where it belongs!) but the following issues
prevent
it from being useful for localized formatting (among other jobs):

  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to
restrict
    types used in string interpolation.
  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation
can't
    distinguish (fragments of) the base string from the string
substitutions.

In the long run, we should improve Swift string interpolation to the point
where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs
to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered
and
incoherent, with 6 ways to transform a C string into a `String` and four
ways to
do the inverse. These APIs should be replaced with the following

extension String {
  /// Constructs a `String` having the same contents as
`nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as
`nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code
units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented
as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected
will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the
validity of
the result is recorded in the `String`'s `encoding` such that future
accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired
on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which
changes
the process of properly identifying `Character` boundaries. We need to
update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored
using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some
come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by
many
different representations.

That said, the highest performance code always requires static knowledge
of the
data structures on which it operates, and for this code, dynamic selection
of
representation comes at too high a cost. Heavy-duty text processing
demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles
dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that
encode
representation information into the type, such as
`NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies
entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are
often
processed most efficiently by recognizing ASCII structural elements as
ASCII,
and capturing the arbitrary sections between them in more-general
strings. The
current String API offers no way to efficiently recognize ASCII and skip
past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is
specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
   enable unified slicing syntax.

2. **A subtype relationship** between
   `Substring` and `String`, enabling framework APIs to traffic solely in
   `String` while still making it possible to avoid copies by handling
   `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is
not in
  question here; this is about what encodings must be storable, without
  transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete
type in
  which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code
units need
  to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design
choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
  `UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String`
indices can
  be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able
to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an
API
appropriate for `String`. Instead, string APIs would be provided by a
generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such
as
access to the specific encoding, by putting them behind a `.unicode`
property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage>
  : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the
right
type” (`String`) without thinking, and the new APIs will show up on
`Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the
word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and
kept
around because it seemed useful, but it was never truly *designed* for
client
programmers. We need to decide what happens with it. Presumably
*something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
    substantially reduce the scope of `Int`'s API by using more
    generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard
and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[
http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation).
Note
that inserting Unicode scalar values to prevent merging of grapheme
clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary
would
not be found in the result), so would be relatively costly to implement,
with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned
by
  the Unicode standard for this purpose. In fact there's
  a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
  dedicated to it. In particular, §5.17 says:

  > When comparing text that is visible to end users, a correct linguistic
sort
  > should be used, as described in _Section 5.16, Sorting and
  > Searching_. However, in many circumstances the only requirement is for
a
  > fast, well-defined ordering. In such cases, a binary ordering can be
used.

  [:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly
onto
properties in a table that's indexed by unicode scalar value. This table
is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd
either
establish the existing practice that the Unicode committee would
standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jon Hull) #8

Thanks for all the hard work!

Still digesting, but I definitely support the goal of string processing even better than Perl. Some random thoughts:

• I also like the suggestion of implicit conversion from substring slices to strings based on a subtype relationship, since I keep running into that issue when trying to use array slices. It would be nice to be able to specify that conversion behavior with other types that have a similar subtype relationship.

• One thing that stood out was the interpolation format syntax, which seemed a bit convoluted and difficult to parse:

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

Have you considered treating the interpolation parenthesis more like the function call syntax? It should be a familiar pattern and easily parseable to someone versed in other areas of swift:

  “Something with leading zeroes: \(x, fill: .zero, width: 8)"

I think that should work for the common cases (e.g. padding, truncating, and alignment), with string-returning methods on the type (or even formatting objects ala NSNumberFormatter) being used for more exotic formatting needs (e.g. outputting a number as Hex instead of Decimal)

• Have you considered having an explicit .machine locale which means that the function should treat the string as machine readable? (as opposed to the lack of a locale)

• I almost feel like the machine readableness vs human readableness of a string is information that should travel with the string itself. It would be nice to have an extremely terse way to specify that a string is localizable (strawman syntax below), and that might also classify the string as human readable.

  let myLocalizedStr = $”This is localizable” //This gets used as the comment in the localization file

• Looking forward to RegEx literals!

Thanks,
Jon

···

On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Rien) #9

Wow, I fully support the intention (becoming better than Perl) but I cannot comment on the contents without studying it for a couple of days…

Regards,
Rien

Site: http://balancingrock.nl
Blog: http://swiftrien.blogspot.com
Github: http://github.com/Swiftrien
Project: http://swiftfire.nl

···

On 20 Jan 2017, at 03:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Joe Groff) #10

Jordan points out that the generalized slicing syntax stomps on '...x' and 'x...', which would be somewhat obvious candidates for variadic splatting if that ever becomes a thing. Now, variadics are a much more esoteric feature and slicing is much more important to day-to-day programming, so this isn't the end of the world IMO, but it is something we'd be giving up.

-Joe


(Matthew Johnson) #11

This looks really great to me. I am not an expert in this area so I don’t have a lot of detailed comments. That said, it looks like it will significantly improve the string handling experience of app developers, including better bridging to the APIs we work with every day.

I did notice one particularly interesting thing in the sketch of the Unicode protocol. This section specifically calls out that it relies on features that are “planned but not yet implemented”. I was surprised to see this:

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection

Conformances via protocol extensions is listed as “unlikely” in the generics manifesto. Has something changed such that this is now a “planned” feature (or at least less “unlikely”)?

···

On Jan 19, 2017, at 8:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Ole Begemann) #12

The downside of having two types is the inconvenience of sometimes having
a
`Substring` when you need a `String`, and vice-versa. It is likely this
would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant
to
existing code that assumes `String` is the currency type.

I'm not familiar with the term "currency type" that appears several
times in the document. Could you clarify what it means? Googling it
proved difficult because all results are about the "money" meaning of
"currency".


(Jordan Rose) #13

I want to start by saying great work, Ben and Dave. I know you've put a lot of time into this (and humored me in several Apple-internal discussions) and what's here looks like a great overhaul of String, balancing several tricky constraints. I do want to record some comments on specific parts of the proposal that I still have concerns about, but as usual you can of course take these with a grain of salt.

To ease the pain of type mismatches, Substring should be a subtype of String in the same way that Int is a subtype of Optional<Int>. This would give users an implicit conversion from Substring to String, as well as the usual implicit conversions such as [Substring] to [String] that other subtype relationships receive.

I'm concerned about this for two reasons: first, because even with the comparison with Optional boxing, this is basically adding arbitrary conversions to the language, which we took out in part because of their disastrous effect on the performance of the type checker; and second, because this one in particular makes an O(N) copy operation very subtle (though admittedly one you incur all the time today, with no opt-out). A possible mitigation for the first issue would be to restrict the implicit conversion to arguments, like we do for inout-to-pointer conversions. It's still putting implicit conversions back into the language, though.

Therefore, APIs that operate on an NSString/NSRange pair should be imported without the NSRange argument. The Objective-C importer should be changed to give these APIs special treatment so that when a Substring is passed, instead of being converted to a String, the full NSString and range are passed to the Objective-C method, thereby avoiding a copy.

I'm very skeptical about assuming that a method that takes an NSString and an NSRange automatically means to apply that NSRange to the NSString, but fortunately it may not be much of an issue in practice. A quick grep of Foundation and AppKit turned up only 45 methods that took both an NSRange and an NSString *, clustered on a small number of classes; in less than half of these cases would the transformation to Substring actually be valid (the other ranges refer to some data in the receiver rather than the argument). I've attached these results below, if you're interested.

(Note that I left out methods on NSString itself, since those are manually bridged to String, but there aren't actually too many of those either. "Foundation and AppKit" also isn't exhaustive, of course.)

  associatedtype ExtendedASCII
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

This isn't a criticism, just a question: why constrain the collection to UInt32 elements? It seems unfortunate that the most common buffer types (UTF-16, UTF-8, or "unparsed bytes") can't just be passed as-is.

  var unicodeScalars: UnicodeScalars { get }

Typo: this appears twice in the Unicode protocol.

We should represent these aspects as orthogonal, composable components, abstracting pattern matchers into a protocol like this one, that can allow us to define logical operations once, without introducing overloads, and massively reducing API surface area.

I'm still uneasy about the performance of generalized matching operations built on top of Collection. I'm not sure we can reasonably expect the compiler to lower that all down to bulk memory accesses. That's at least only one part of the manifesto, though.

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Sorry, what is 'clipboard'? I think I'm missing something in this section—it's talking about how it's important to have a stable representation for string positions across the different index types, but the code sample doesn't directly connect for me.

Our support for interoperation with nul-terminated C strings is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with the following

This proposal doesn't plan to remove the implicit, scoped, argument-only String-to-UnsafePointer<CChar> conversion, does it? (Is that one of the 6?)

To address this need, we can build models of the Unicode protocol that encode representation information into the type, such as NFCNormalizedUTF16String.

Not an urgent thought, but I wonder if these alternate representations really belong in the stdlib, as opposed to some auxiliary library like "SwiftStrings" or "CoreStrings", and if so whether that's still part of the standard Swift distribution or just a plain old SwiftPM package that happens to be maintained by Apple (and maybe comes preincluded with Xcode for now).

That's all I have. Again, great work, and godspeed.

Jordan

(unfortunately I am not in charge of implementing any of the features you need for this, at least as far as I know)

NSRange.txt (9.58 KB)

···

On Jan 19, 2017, at 18:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave


(Karl) #14

Very nice improvements overall!

To ease the pain of type mismatches, Substring should be a subtype of String in the same way that Int is a subtype of Optional<Int>. This would give users an implicit conversion from Substring to String, as well as the usual implicit conversions such as [Substring] to [String] that other subtype relationships receive.

As others have said, it would be nice for this to be more general. Perhaps we can have a special type or protocol, something like RecursiveSlice?

A Substring passed where String is expected will be implicitly copied. When compared to the “same type, copied storage” model, we have effectively deferred the cost of copying from the point where a substring is created until it must be converted to Stringfor use with an API.

Could noescape parameters/new memory model with borrowing make this more general? Again it seems very useful for all kinds of Collections.

The “Empty Subscript”

Empty subscript seems weird. IMO, it’s because of the asymmetry between subscripts and computed properties. I would favour a model which unifies computed properties and subscripts (e.g. computed properties could return “addressors” for in-place mutation).
Maybe this could be an “entireCollection”/“entireSlice" computed property?

The goal is that Unicode exposes the underlying encoding and code units in such a way that for types with a known representation (e.g. a high-performance UTF8String) that information can be known at compile-time and can be used to generate a single path, while still allowing types like String that admit multiple representations to use runtime queries and branches to fast path specializations.

Typo: “unicodeScalars" is in the protocol twice.

If I understand it, CodeUnits is the thing which should always be defined by conformers to Unicode, and UnicodeScalars and ExtendedASCII could have default implementations (for example, UTF8String/UTF16String/3rd party conformers will use those), and String might decide to return its native buffer (e.g. if Encoding.CodeUnit == UnicodeScalar).

I’m just wondering how difficult it would be for a 3rd-party type to conform to Unicode. If you’re developing a text editor, for example, it’s possible that you may need to implement your own String-like type with some optimised storage model and it would be nice to be able to use generic algorithms with them. I’m thinking that you will have some kind of backing buffer, and you will want to expose regions of that to clients as Strings so that they can render them for UI or search through them, etc, without introducing a copy just for the semantic understanding that this data region contains some text content.

I’ll need to examine the generic String idea more, but it’s certainly very interesting...

Indexes

One thing which I think it critical is the ability to advance an index by a given number of codeUnits. I was writing some code which interfaced with the Cocoa NSTextStorage class, tagging parts of a string that a user was editing. If this was an Array, when the user inserts some elements before your stored indexes, those indexes become invalid but you can easily advance by the difference to efficiently have your indexes pointing to the same characters.

Currently, that’s impossible with String. If the user inserts a string at a given index, your old indexes may not even point to the start of a grapheme cluster any more, and advancing the index is needlessly costly. For example:

var characters = "This is a test".characters
assert(characters.count == 14)

// Store an index to something.
let endBeforePrepending = characters.endIndex

// Insert some characters somewhere.
let insertedCharacters = "[PREPENDED]".characters
assert(insertedCharacters.count == 11)
characters.replaceSubrange(characters.startIndex..<characters.startIndex, with: insertedCharacters)

// This isn’t really correct.
let endAfterPrepending = characters.index(endBeforePrepending, offsetBy: insertedCharacters.count)
assert(endAfterPrepending == characters.endIndex) // Fails Anyway. 24 != 25

The manifesto is correct to emphasise machine processing of Strings, but it should also ensure that machine processing of mutable Strings is efficient. That way we can tag backing-Strings inside user-interface components and maintain those indices in a unicode-safe way.

The way to solve this would be that, when replacing or removing a portion of a String, you learn how many CodeUnits in the receiver’s encoding were inserted/removed so you can shift your indexes accordingly.


(Brent Royal-Gordon) #15

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

There is so, so much good stuff here. I'm really looking forward to seeing how these ideas develop and enter the language.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

I am very glad to see this statement in a Swift design document. I have a few ideas about this, but they can wait until the next version.

At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.

That's a great catch.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Sounds good to me.

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

While it's great that `compared(to:case:etc.)` is parallel to `compared(to:)`, you don't actually want to *use* anything like `compared(to:)` if you can help it. Think about the clarity at the use site:

  if foo.compared(to: bar, case: .insensitive, locale: .current) == .before { … }

The operands and sense of the comparison are kind of lost in all this garbage. You really want to see `foo < bar` in this code somewhere, but you don't.

I'm struggling a little with the naming and syntax, but as a general approach, I think we want people to use something more like this:

  if StringOptions(case: .insensitive, locale: .current).compare(foo < bar) { … }

Which might have an implementation like:

  // This protocol might actually be part of your `Unicode` protocol; I'm just breaking it out separately here.
  protocol StringOptionsComparable {
    func compare(to: Self, options: StringOptions) -> SortOrder
  }
  extension StringOptionsComparable {
    static func < (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
      return (lhs, rhs, { $0 == .before })
    }
    static func == (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
      return (lhs, rhs, { $0 == .same })
    }
    static func > (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
      return (lhs, rhs, { $0 == .after })
    }
    // etc.
  }
  
  struct StringOptions {
    // Obvious properties and initializers go here
    
    func compare<StringType: StringOptionsComparable>(_ expression: (lhs: StringType, rhs: StringType, op: (SortOrder) -> Bool)) -> Bool {
      return expression.op( expression.lhs.compare(to: expression.rhs, options: self) )
    }
  }

You could also imagine much less verbose syntaxes using custom operators. Strawman example:

  if foo < bar %% (case: .insensitive, locale: .current) { … }

I think this would make human-friendly comparisons much easier to write and understand than adding a bunch of options to a `compared(to:)` call.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs.
...
Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2]

This sounds good to me.

### Unification of Slicing Operations

I think you know what I think about this. :^)

(By the way, I've at least partially let this proposal drop for the moment because it's so dependent on generic subscripts to really be an improvement. I do plan to pick it up when those arrive; ping me then if I don't notice.)

A question, though. We currently have a couple of methods, mostly with `subrange` in their names, that can be thought of as slicing operations but aren't:

  collection.removeSubrange(i..<j)
  collection[i..<j].removeAll()
  
  collection.replaceSubrange(i..<j, with: others)
  collection[i..<j].replaceAll(with: others) // hypothetically

Should these be changed, too? Can we make them efficient (in terms of e.g. copy-on-write) if we do?

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice.

I agree, and I think `Substring` is the right name for it: parallel to `SubSequence`, explains where it comes from, captures the trade-offs nicely. `StringSlice` is parallel to `ArraySlice`, but it strikes me as a "foolish consistency", as the saying goes; it avoids a term of art for little reason I can see.

However, is there a reason we're talking about using a separate `Substring` type at all, instead of using `Slice<String>`? Perhaps I'm missing something, but I *think* it does everything we need here. (Of course, you could say the same thing about `ArraySlice`, and yet we have that, too.)

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`.

I've seen people struggle with the `Array`/`ArraySlice` issue when writing recursive algorithms, so personally, I'd like to see a more general solution that handles all `Collection`s.

Rather than having an implicit copying conversion from `String` to `Substring` (or `Array` to `ArraySlice`, or `Collection` to `Collection.SubSequence`), I wonder if implicitly converting in the other direction might be more useful, at least in some circumstances. Converting in this direction does *not* involve an implicit copy, merely calculating a range, so you won't have the same performance surprises. On the other hand, it's also useful in fewer situations.

(If we did go with consistently using `Slice<T>`, this might merely be a special-cased `T -> Slice<T>` conversion. One type, special-cased until we feel comfortable inventing a general mechanism.)

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

I do like this as a guideline, though. There's definitely room in the standard library for "a string and a range of that string to operate upon".

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

That's a little bit funky, but I guess it might work.

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

I sort of like this, but note that if we use `String` -> `Substring` conversion instead of the other way around, there's less magic needed to get this effect: `NSString, NSRange` can be imported as `Substring`, which automatically converts from `String` in exactly the manner we want it to.

Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

I'm sorry, I think the name is too clever by half. It sounds something like what `UnicodeCodec` actually is. Or maybe a type representing a version of the Unicode standard or something. I'd prefer something more prosaic like `StringProtocol`.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

I mean, sure, but then you imagine it being used generically:

  func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
  // which concrete types can `source` be???

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

Yes.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

*Very* yes.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Very, *very* yes.

If we do this, rather than your `%` operator (or whatever it becomes), I wonder if we can have these extensions:

  // Assuming a protocol like:
  protocol Pattern {
    associatedtype PatternElement
    func matches<CollectionType: Collection>(…) -> … where CollectionType.Element == Element
  }
  extension Equatable: Pattern {
    typealias PatternElement = Self
    …
  }
  extension Collection: Pattern where Element: Equatable {
    typealias PatternElement = Element
  }

...although then `Collection` would conform to `Pattern` through both itself and (conditionally) `Equatable`. Hmm.

I suppose we faced this same problem elsewhere and ended up with things like:

  mutating func append(_ element: Element)
  mutating func append<Seq: Sequence>(contentsOf seq: Seq) where Seq.Iterator.Element == Element

So we could do things like:

  str.firstMatch("x") // single element, so this is a Character
  str.firstMatch(contentsOf("xy"))
  str.firstMatch(anyOf(["x", "y"] as Set))

#### Index Interchange Among Views

I really, really, really want this.

We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Wait, you do? Doesn't that mean either using UTF-32, inventing a UTF-24 to use, or using some kind of complicated side table that adjusts for all the multi-unit characters in a UTF-16 or UTF-8 string? None of these sound ideal.

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).

I think it should be forgiving, and I think it should be forgiving in a very specific way: It should treat indexing in the middle of a cluster as though you indexed at the beginning.

The reason is `AttributedString`. You can think of `AttributedString` as being a type which adds additional views to a `String`; these views are indexed by `String.Index`, just like `String`, `String.UnicodeScalarView`, et.al., and advancing an index with these views advances it to the beginning of the next run. But you can also just subscript these views with an arbitrary index in the middle of a run, and it'll work correctly.

I think it would be useful for this behavior to be consistent among all `String` views.

Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Good.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views.

I don't know about this, at least for the UTF-16 view. Here's why:

That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

I worry that this conversion will be too obscure. In Objective-C, you don't really think very much about what "character" means; it's just an index that points to a location inside the string. I don't think people will know to use the `utf16` view instead of the others—especially the plain `String` version, which would be the most obvious one to use.

I think I'd prefer to see the following:

1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.

2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16 offset.

3. With just the standard library imported, `String.Index` does not have any obvious way to convert to or from an `Int` offset; you use `index(_:offsetBy:)` on one of the views. `utf16`'s implementation is just faster than the others.

4. Foundation adds `init(_:)` methods to `String.Index` and `Int`, as well as `Range<String.Index>` and `NSRange`, which perform mutual conversions:

  XCTAssertEqual(Int(String.Index(cocoaIndex)), cocoaIndex)
  XCTAssertEqual(NSRange(Range<String.Index>(cocoaRange)), cocoaRange)

I think this would really help to guide people to the right APIs for the task.

(Also, it would make my `AttributedString` thing work better, too.)

### Formatting

Briefly: I am, let's say, 95% on board with your plan to replace format strings with interpolation and format methods. The remaining 5% concern is that it we'll need an adequate replacement for the ability to load a format string dynamically and have it reorder or alter the formatting of interpolated values. Obviously dynamic format strings are dangerous and limited, but where you *can* use them, they're invaluable.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

If I find some copious free time, I could try to develop proposals for one or both of these. Would there be interest in them at this point? (Feel free to contact me off-list about this, preferably in a new thread.)

(Okay, one random thought, because I can't resist: Perhaps the "\(…)" syntax can be translated directly into an `init(…)` on the type you're creating. That is, you can write:

  let x: MyString = "foo \(bar) baz \(quux, radix: 16)"

And that translates to:

  let x = MyString(stringInterpolationSegments:
    MyString(stringLiteral: "foo "),
    MyString(bar),
    MyString(stringLiteral: " baz "),
    MyString(quux, radix: 16)
  )

That would require you to redeclare `String` initializers on your own string type, but you probably need some of your own logic anyway, right?)

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

For what it's worth, by using a hacky workaround for SR-1260, I've written (Swift 2.0) code that passes strings with interpolations through the Foundation localized string tables: <https://gist.github.com/brentdax/79fa038c0af0cafb52dd> Obviously that's just a start, but it is incredibly convenient.

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

These APIs are much better than the status quo, but it's a shame that we can't have them handle non-nul-terminated data, too.

Actually... (Begin shaggy dog story...)

Suppose you introduce an `UnsafeNulTerminatedBufferPointer` type. Then you could write a *very* high-level API which handles pretty much every conversion under the sun:

  extension String {
    /// Constructs a `String` from a sequence of `codeUnits` in an indicated `encoding`.
    ///
    /// - Parameter codeUnits: A sequence of code units in the given `encoding`.
    /// - Parameter encoding: The encoding the code units are in.
    init<CodeUnits: Sequence, Encoding: UnicodeEncoding>(_ codeUnits: CodeUnits, encoding: Encoding)
      where CodeUnits.Iterator.Element == Encoding.CodeUnit
  }

For UTF-8, at least, that would cover reading from `Array`, `UnsafeBufferPointer`, `UnsafeRawBufferPointer`, `UnsafeNulTerminatedBufferPointer`, `Data`, you name it. Maybe we could have a second one that always takes something producing bytes, no matter the encoding used:

  extension String {
    /// Constructs a `String` from the code units contained in `bytes` in a given `encoding`.
    ///
    /// - Parameter bytes: A sequence of bytes expressing code units in the given `encoding`.
    /// - Parameter encoding: The encoding the code units are in.
    init<Bytes: Sequence, Encoding: UnicodeEncoding>(_ codeUnits: CodeUnits, encoding: Encoding)
      where CodeUnits.Iterator.Element == UInt8
  }

These two initializers would replace...um, something like eight existing ones, including ones from Foundation. On the other hand, this is *very* generic. And, unless we actually changed the way `char *` imported to `UnsafeNulTerminatedBufferPointer<CChar>`, the C string call sequence would be pretty complicated:

  String(UnsafeNulTerminatedBufferPointer(start: cString), encoding: UTF8.self)

So you might end up having to wrap it in an `init(cString:)` anyway, just for convenience. Oh well, it was worth exploring.

Prototype of the above: https://gist.github.com/brentdax/8b71f46b424dc64abaa77f18556e607b

(Hmm...maybe bridge `char *` to a type like this instead?

  struct CCharPointer {
    var baseAddress: UnsafePointer<CChar> { get }
    var nulTerminated: UnsafeNulTerminatedBufferPointer<CChar> { get }
    func ofLength(_ length: Int) -> UnsafeBufferPointer<CChar>
  }

Nah, probably not gonna happen...)

init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

By the way, I just noticed an impedance mismatch in current Swift: `CChar` is usually an `Int8`, but `UnicodeScalar` and `UTF8` currently want `UInt8`. It'd be nice to address this somehow, if only by adding some signed variants or something.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

Just putting a pin in this, because I'll want to discuss it a little later.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

This sounds interesting, but:

1. It doesn't sound like you anticipate there being any way to compare an element of the `extendedASCII` view to a character literal. That seems like it'd be really useful.

2. I don't really understand how you envision using the "data specific to the underlying encoding" sections. Presumably you'll want to convert that data into a string eventually, right?

Do you have pseudocode or something lying around that might help us understand how you think this might be used?

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

As long as you're here, we haven't talked about `UnicodeEncoding` much. I assume this is a slightly modified version of `UnicodeCodec`? Anything to say about it?

If it *is* similar to `UnicodeCodec`, one thing I will note is that the way `UnicodeCodec` works in code units is rather annoying for I/O. It may make sense to have some sort of type-erasing wrapper around `UnicodeCodec` which always uses bytes. (You then have to worry about endianness, of course...)

### Should there be a string “facade?”

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

I think this is a very, very interesting idea. A few notes:

* Earlier, I said I didn't like `Unicode` as a protocol name. If we go this route, I think `StringStorage` is a good name for that protocol. The default storage might be something like `UTF16StringStorage`, or just, you know, `DefaultStringStorage`.

* Earlier, you mentioned the tension between using multiple representations for flexibility and pinning down one representation for speed. One way to handle this might be to have `String`'s default `StringStorage` be a superclass or type-erased wrapper or something. That way, if you just write `String`, you get something flexible; if you write `String<NFCNormalizedUTF16StringStorage>`, you get something fast.

* Could `NSString` be a `StringStorage`, or support a trivial wrapper that converts it into a `StringStorage`? Would that be helpful at all?

* If we do this, does `String.Index` become a type-specific thing? That is, might `String<UTF8Storage>.Index` be different from `String<UTF16Storage>.Index`? What does that mean for `String.Index` unification?

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?

`debugDescription`, I think, is non-localized; it's something helpful for the programmer, and the programmer's language is not the user's. It's also usually something you don't want to put *too* much effort into, other than to dump a lot of data about the instance.

`description` would have to change to be localizable. (Specifically, it would have to take a locale.) This is doable, of course, but it hasn't been done yet.

* Is returning a `String` efficient enough?

I'm not sure how important efficiency is for `description`, honestly.

* Is `debugDescription` pulling the weight of the API surface area it adds?

Maybe? Or maybe it's better off as part of the `Mirror` instead of a property on the instance itself.

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

Maybe. One complication there is that `Unicode` presumably supports mutation, which `StaticString` doesn't.

Another possibility I've discussed in the past is renaming `StaticString` to `StringLiteral` and using it largely as a way to initialize `String`. (I mentioned that in a thread about the need for public integer and floating-point literal types that are more expressive now that we're supporting larger integer/float types.) It could have just enough API surface to access it as a buffer of UTF-8 bytes and thereby build a `String` or `Data` from it.

Well, that's it for this massive email. You guys are doing a hell of a job on this.

Hope this helps,

···

On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

--
Brent Royal-Gordon
Architechies


(Félix Cloutier) #16

I read through the manifesto and most of the discussion here. I'd like to bring two points that I haven't seen discussed.

First, the fact that the Unicode standard says that strings ending with an isolated combining character is "degenerate" or "defective" doesn't necessarily mean that ignoring that case is the right thing to do. In fact, it means that Unicode won't do anything to protect programs against these, and if Swift doesn't, chances are that no one will. Isolated combining characters break a number of expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

One not particularly contrived example of a vulnerability that this could cause would be a website that encodes session information in a cookie with comma-separated values (encrypted and authenticated and all, like RoR does), inside of which the user has indirect control over at least one textual field (like an username). If the serializing code looks like `"\(id),\(email),\(username)`, and an attacker's username starts with a combining character, they could cause the delimiting comma to be combined into something different (like this acute comma: ,́) and not recognized by the parsing code, or worse, misinterpreted by the parsing code.

To be safe, you'd have to either find out that the string starts with a combining character (the only easy way to find out right now seems to be to append the string to something else), or have a sentinel character, neither of which is obvious.

I recognize that this is how things are currently handled anyway ("," + "\u{0301}" gives you an acute comma with Swift 3.0.2), but as this is currently on the table, I figured that I should bring it up. This possibly matters more now that it has been decided that Strings are meant to represent machine-oriented text.

My second concern is with how easy it is to convert an Int to a String index. I've been vocal about this before: I'm concerned that Swift developers will adequate Ints to random-access String iterators, which they are emphatically not. String.Index(100) is proposed as a constant-time operation that *usually* gives you an index to the 100th character of a string, whereas the sanctioned (and *always* correct) way is cumbersome and linear-time. I can easily see people mistake it for a feature and disregard internationalization (or worse: emojis! gasp!) for what they perceive to be a free performance improvement. As a person cursed with an acute accent in my first name, I'm very sensitive to this.

Félix

···

Le 19 janv. 2017 à 18:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> a écrit :

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
   (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
   by and for humans from language-independent
   operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
   discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
  as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
   guidance in the
   [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
  accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
  by their un-localized counterparts. This naturally skews complexity towards
  localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:

 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
   somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
  sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}

This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.

extension String {
func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
   investigate and understand their behavior. This is useful for teaching new
   programmers, but also good for experienced programmers who want to
   understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
   strings. For example, searching and matching operations will always produce
   results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
   use-cases, including parsing, pattern matching, and language-specific
   transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
   available that are appropriate to `String`'s default role as the vehicle for
   machine processed text.

   The methods `String` would inherit from `Collection`, where similar to
   higher-level string algorithms, have the right semantics. For example,
   grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
   `flatMap` with case-conversion, produce the same results one would expect
   from whole-string ordering comparison, equality comparison, and
   case-conversion, respectively. `reverse` operates correctly on graphemes,
   keeping diacritics moored to their base characters and leaving emoji intact.
   Other methods such as `indexOf` and `contains` make obvious sense. A few
   `Collection` methods, like `min` and `max`, may not be particularly useful
   on `String`, but we don't consider that to be a problem worth solving, in
   the same way that we wouldn't try to suppress `min` and `max` on a
   `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
   such as parsing and pattern matching, should apply to any `Collection`, and
   many of the benefits we want for `Collections`, such
   as unified slicing, should accrue
   equally to `String`. Making `String` part of the same protocol hierarchy
   allows us to write these operations once and not worry about keeping the
   benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
   string processing, and all other sliceable things are `Collection`s.
   Because of its collection-like behavior, users naturally think of `String`
   in collection terms, but run into frustrating limitations where it fails to
   conform and are left to wonder where all the differences lie. Many simply
   “correct” this limitation by declaring a trivial conformance:

 extension String : BidirectionalCollection {}

   Even if we removed indexing-by-element from `String`, users could still do
   this:

     extension String : BidirectionalCollection {
       subscript(i: Index) -> Character { return characters[i] }
     }

   It would be much better to legitimize the conformance to `Collection` and
   simply document the oddity of any concatenation corner-cases, than to deny
   users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
  of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
  that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
  String`, `var isASCII: Bool`, and, to the extent they can be sensibly
  generalized, queries of unicode properties that should also be exposed on
  `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
   in-place mutation:

       s[i..<j].mutate()

* Slicing from an index to the end, or from the start to an index, is done
   with a method and does not support in-place mutation:

       s.prefix(upTo: i).readOnly()

Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
 subscript() -> SubSequence { 
   return self[startIndex..<endIndex] 
 }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:

for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:

extension String {
   func containsChar(_ x: Character) -> Bool {
       return !isEmpty && (first == x || dropFirst().containsChar(x))
   }
}

For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):

extension String {
   // add optional argument tracking progress through the string
   func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
       let idx = idx ?? startIndex
       return idx != endIndex
           && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
   }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
   : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
   : BidirectionalCollection  where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
   : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
   to rhs: Other,
   case caseSensitivity: StringSensitivity? = nil,
   diacritic diacriticSensitivity: StringSensitivity? = nil,
   width widthSensitivity: StringSensitivity? = nil,
   in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
   // Satisfy protocol requirement
   mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
     where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
  ...
}

The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
   conversion step. An index into a `String`'s `characters` must be translated
   before it can be used as a position in its `unicodeScalars`. Although these
   translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
   positions as `Int`s and regions as `NSRange`s, which can only reference a
   `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
  index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
  the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
 func index(offset: IndexDistance) -> Index {
   return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
   return distance(from: startIndex, to: i)
 }
}

Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
   need for this step prevents the instance from being used and discarded in
   the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
   types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
   distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following

extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 ///   bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 ///   the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 ///   should be interpreted.
 init<Encoding: UnicodeEncoding>(
   cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
   encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
   _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
  enable unified slicing syntax.

2. **A subtype relationship** between
  `Substring` and `String`, enabling framework APIs to traffic solely in
  `String` while still making it possible to avoid copies by handling
  `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
   substantially reduce the scope of `Int`'s API by using more
   generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Joshua Alvarado) #17

The strings proposal is a warm welcome to the Swift Language and I believe
many developers are happy to see Strings become a priority. String
processing may be one of the most common tasks for a developer day to day.
That being said one thing in the proposal I believe is not correct is the
leaving out regular expressions.

"Addressing regular expressions is out of scope for this proposal."

Working with regular expressions in both Objective-c and Swift is a real
pain. I don't believe that because NSRegularExpression exists is a good
enough reason to leave it out of Swift 4 string improvements.
NSRegularExpression is NOT easily retrofitted to strings. Perl, ruby,
javascript and many more programming languages have native easy to use
regular expression functionality built in.

Examples:
Ruby:
/hey/ =~ 'hey what's up'
/world/.match('hello world')

Javascript:
'javascript regex'.search(/regex/)
'hello replace'.replace(/replace/, 'world')

Perl
$statement = "The quick brown fox";

if ($statement = /quick/) {
   print "this is what's up\n";
}

Now let's look at NSRegularExpression...

Swift:

do {

    let pattern = "\\w" // escape everything
    let regex = NSRegularExpression(pattern, options)
    let results = regex.matches(in: str, options: .reportCompletion, range:
NSRange(location: 0, length: str.characters.distance(from: str.startIndex,

   results.forEach {
      print($0) // why is this a NSTextCheckResult?!
    }

} catch {
   // welp out of luck
}

Yes I'm fully aware of the method:

    str.replaceOccurences(of: "pattern" with: "something" options:
.regularExpression, range: nil)

but it is just not enough for what is needed. Also, it is confusing to have
a replace regex method separate from NSRegularExpression. It was not easy
to find.

Taken from NSHipster <http://nshipster.com/nsregularexpression/>:

Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has
the most long-winded and byzantine regular expression interface you’re ever
likely to come across.

There is no way to achieve the goal of being better at string processing
than Perl without regular expressions being addressed. It just should not
be ignored.

···

to: str.endIndex)))

On Thu, Jan 19, 2017 at 7:56 PM, Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](
https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined
thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20160725/025676.html):

> **String re-evaluation**: String is one of the most important fundamental
> types in the language. The standard library leads have numerous ideas
of how
> to improve the programming model for it, without jeopardizing the goals
of
> providing a unicode-correct-by-default model. Our goal is to be better
at
> string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text
processing:

  1. Ergonomics
  2. Correctness
  3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the
scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are
mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics
or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in
the
foundation additions to `String` where patterns and practices found
elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily
understood
given a signature and a one-line summary. Today, `String` fails that
test. As
you can see, the Standard Library and Foundation both contribute
significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0]
String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

  * Restoring `Collection` conformance and dropping the `.characters` view.
  * Providing a more general, composable slicing syntax.
  * Altering `Comparable` so that parameterized
    (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
  * Clearly separating language-dependent operations on text produced
    by and for humans from language-independent
    operations on text produced by and for machine processing.
  * Relocating APIs that fall outside the domain of basic string
processing and
    discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs
for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for
domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/
foundation/nslinguistictagger),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle
behind
Swift's `String`. That said, the Unicode standard is an evolving
document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs
should
be written so their correctness does not depend on precise stability of
these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate
users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and
should be
improved, the sheer size, complexity, and diversity of these APIs is a
major
contributor to the problem, causing novices to tune out, and more
experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
   as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
    guidance in the
    [Internationalization and Localization Guide](https://developer.
apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/
InternationalizingYourCode/InternationalizingYourCode.html).

Along with appropriate documentation updates, these changes will make
localized
operations more teachable, comprehensible, and approachable, thereby
lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many
operations on
Swift `String` (and `NSString`) are really only appropriate for text that
is
intended to be processed for, and consumed by, machines. The semantics of
the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
   accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not
required
   by their un-localized counterparts. This naturally skews complexity
towards
   localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of
clarifying
the proper default behavior of operations such as comparison, and allows
us to
make [significant optimizations](#collation-semantics) that were
previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but
regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by
default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that
naïve
code is very likely to be correct if it compiles, and that more
sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need
to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different
for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized
sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly
like this:

  x.compared(to: y, case: .sensitive, in: swissGerman)

  x.lowercased(in: .currentLocale)

  x.allMatches(
    somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
  static var currentLocale: Locale { ... }
}

extension Unicode {
  // An example of the option language in declaration context,
  // with nil defaults indicating unspecified, so defaults can be
  // driven by the presence/absence of a specific Locale
  func frobnicated(
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing—
turns
out to be quite interesting, once you pick it apart. The full Unicode
Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a
collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
   sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching
*sequences* of
unicode scalars in the normalized string to *sequences* of triples, which
get
accumulated into a collation key. Predictably, this is where the real
costs
lie.

*However*, there are some bright spots to this story. First, as it turns
out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](http://unicode.org/reports/tr10/#Multi_Level_
Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it
does
for equality: two strings that normalize the same, naturally, will collate
the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare
the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to
hashing: it
is sufficient to hash the string's normalized form, bypassing collation
keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level
applies
even to localized strings, it means that hashing and equality can be
implemented
exactly the same way for localized and non-localized text, and hash tables
with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to
handle
machine-generated and machine-readable text, the default ordering of
`String`s
need no longer use the UCA at all. It is sufficient to order them in any
way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of
grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and
ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting
behavior
consistent across platforms. Currently, we sort `String` according to the
UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are
ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with
binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into
the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/
f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-
the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather
than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
 func compared(to: Self) -> SortOrder
 ...
}

This change will give us a syntactic platform on which to implement
methods with
additional, defaulted arguments, thereby unifying and regularizing
comparison
across the library.

extension String {
 func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also
possible
that the standard library simply adopts Foundation's `ComparisonResult` as
is,
but we believe the community should at least consider alternate naming
before
that happens. There will be an opportunity to discuss the choices in
detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/
f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection`
too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`,
`elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0,
`String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of
combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes
appears to
comport perfectly with Unicode. We think the concatenation problem is
tolerable,
because the cases where it occurs all represent partially-formed
constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING
ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_
Cluster_Boundaries)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The
other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text
editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases
that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can
handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

  * Collection-like operations encourage experimentation with strings to
    investigate and understand their behavior. This is useful for teaching
new
    programmers, but also good for experienced programmers who want to
    understand more about strings/unicode.

  * Extended grapheme clusters form a natural element boundary for Unicode
    strings. For example, searching and matching operations will always
produce
    results that line up on grapheme cluster boundaries.

  * Character-by-character processing is a legitimate thing to do in many
real
    use-cases, including parsing, pattern matching, and language-specific
    transformations such as transliteration.

  * `Collection` conformance makes a wide variety of powerful operations
    available that are appropriate to `String`'s default role as the
vehicle for
    machine processed text.

    The methods `String` would inherit from `Collection`, where similar to
    higher-level string algorithms, have the right semantics. For example,
    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and
application of
    `flatMap` with case-conversion, produce the same results one would
expect
    from whole-string ordering comparison, equality comparison, and
    case-conversion, respectively. `reverse` operates correctly on
graphemes,
    keeping diacritics moored to their base characters and leaving emoji
intact.
    Other methods such as `indexOf` and `contains` make obvious sense. A
few
    `Collection` methods, like `min` and `max`, may not be particularly
useful
    on `String`, but we don't consider that to be a problem worth solving,
in
    the same way that we wouldn't try to suppress `min` and `max` on a
    `Set([UInt8])` that was used to store IP addresses.

  * Many of the higher-level operations that we want to provide for
`String`s,
    such as parsing and pattern matching, should apply to any
`Collection`, and
    many of the benefits we want for `Collections`, such
    as unified slicing, should accrue
    equally to `String`. Making `String` part of the same protocol
hierarchy
    allows us to write these operations once and not worry about keeping
the
    benefits in sync.

  * Slicing strings into substrings is a crucial part of the vocabulary of
    string processing, and all other sliceable things are `Collection`s.
    Because of its collection-like behavior, users naturally think of
`String`
    in collection terms, but run into frustrating limitations where it
fails to
    conform and are left to wonder where all the differences lie. Many
simply
    “correct” this limitation by declaring a trivial conformance:

    ```swift
  extension String : BidirectionalCollection {}
    ```

    Even if we removed indexing-by-element from `String`, users could
still do
    this:

    ```swift
      extension String : BidirectionalCollection {
        subscript(i: Index) -> Character { return characters[i] }
      }
    ```

    It would be much better to legitimize the conformance to `Collection`
and
    simply document the oddity of any concatenation corner-cases, than to
deny
    users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not*
mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_
Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this,
we should:

- Add a `unicodeScalars` view much like `String`'s, so that the
sub-structure
   of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for
sequences
   that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
   generalized, queries of unicode properties that should also be exposed
on
   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift
`UnicodeScalar`
type. This means it is usable on `String`, but only by going through the
unicode
scalar view. To deal with this clash in the short term, `CharacterSet`
should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate
to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling
and
their naming:

  * Slices with two explicit endpoints are done with subscript, and support
    in-place mutation:

    ```swift
        s[i..<j].mutate()
    ```

  * Slicing from an index to the end, or from the start to an index, is
done
    with a method and does not support in-place mutation:
    ```swift
        s.prefix(upTo: i).readOnly()
    ```

Prefix and suffix operations should be migrated to be subscripting
operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/
9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-
sequence-end-ops.md).
With generic subscripting in the language, that will allow us to collapse
a wide
variety of methods and subscript overloads into a single implementation,
and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like
`s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three
options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when
making the substring.
3. Make substrings a different type, with a storage copy on conversion to
string.

We think number 3 is the best choice. A walk-through of the tradeoffs
follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view
into a
subrange of the original `String`'s storage. This is why `String` is 3
words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings
into
multiple smaller strings. But it does mean that a stored substring keeps
the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming
languages,
because applications sometimes extract small strings from large ones and
keep
those small strings long-term. That is considered a memory leak and was
enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but
has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For
example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also
brings
this sub-range capability to any API that operates on `String` "for free".
So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the
`compare`
method itself. The implementation of `compare` does not need to know
anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory
leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice`
type.
The inconvenience of a separate type is mitigated by most operations used
on
`Array` from the standard library being generic over `Sequence` or
`Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice`
would apply to `Substring`:

> Important: Long-term storage of `Substring` instances is discouraged. A
> substring holds a reference to the entire storage of a larger string, not
> just to the portion it presents, even after the original string's
lifetime
> ends. Long-term storage of a `Substring` may therefore prolong the
lifetime
> of large strings that are no longer otherwise accessible, which can
appear
> to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be
performed, and
at this point the substring buffer is copied and the original string's
storage
can be released.

A `String` that was not its own `Substring` could be one word—a single
tagged
pointer—without requiring additional allocations. `Substring`s would be a
view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage
of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this
would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant
to
existing code that assumes `String` is the currency type. To ease the pain
of
type mismatches, `Substring` should be a subtype of `String` in the same
way
that `Int` is a subtype of `Optional<Int>`. This would give users an
implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type
they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is
always
a reasonable default**. A `Substring` passed where `String` is expected
will be
implicitly copied. When compared to the “same type, copied storage” model,
we
have effectively deferred the cost of copying from the point where a
substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this
guideline:
if for performance reasons you are tempted to add a `Range` argument to
your
method as well as a `String` to avoid unnecessary copies, you should
instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a
`String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you
have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
  subscript() -> SubSequence {
    return self[startIndex..<endIndex]
  }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for
an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be
done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are
proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy
of
the underlying storage when it detects the string is being "stored" for
long
term usage, say when it is assigned to a stored property. The trouble with
this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely
on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array
of
substrings. It would also be difficult to distinguish intentional
medium-term
storage of substrings, say by a lexer. There does not appear to be an
effective
consistent rule that could be applied in the general case for detecting
when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage",
the
optimizer could be enhanced to to reduce the impact of some of those
copies.
For example, this code could be optimized to pull the invariant substring
out
of the loop:

for _ in 0..<lots {
  someFunc(takingString: bigString[bigRange])
}

It's worth noting that a similar optimization is needed to avoid an
equivalent
problem with implicit conversion in the "different type, shared storage"
case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use
cases
that cannot be optimized as easily. Consider the following simple
definition of
a recursive `contains` algorithm, which when substring slicing is linear
makes
the overall algorithm quadratic:

extension String {
    func containsChar(_ x: Character) -> Bool {
        return !isEmpty && (first == x || dropFirst().containsChar(x))
    }
}

For the optimizer to eliminate this problem is unrealistic, forcing the
user to
remember to optimize the code to not use string slicing if they want it to
be
efficient (assuming they remember):

extension String {
    // add optional argument tracking progress through the string
    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) ->
Bool {
        let idx = idx ?? startIndex
        return idx != endIndex
            && (self[idx] == x || containsCharacter(x, atOrAfter:
index(after: idx)))
    }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the
non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be
imported
without the `NSRange` argument. The Objective-C importer should be
changed to
give these APIs special treatment so that when a `Substring` is passed,
instead
of being converted to a `String`, the full `NSString` and range are passed
to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help
users
manually handle any cases that remain, Foundation should be augmented to
allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a
protocol.
Since Unicode conformance is a key feature of string processing in swift,
we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not
yet implemented in
  Swift, and should be considered a sketch rather than a final design.

protocol Unicode
  : Comparable, BidirectionalCollection where Element == Character {

  associatedtype Encoding : UnicodeEncoding
  var encoding: Encoding { get }

  associatedtype CodeUnits
    : RandomAccessCollection where Element == Encoding.CodeUnit
  var codeUnits: CodeUnits { get }

  associatedtype UnicodeScalars
    : BidirectionalCollection  where Element == UnicodeScalar
  var unicodeScalars: UnicodeScalars { get }

  associatedtype ExtendedASCII
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

  var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
  // ... define high-level non-mutating string operations, e.g. search ...

  func compared<Other: Unicode>(
    to rhs: Other,
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection {
    // Satisfy protocol requirement
    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
      where C.Element == Element

  // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units
in
such a way that for types with a known representation (e.g. a
high-performance
`UTF8String`) that information can be known at compile-time and can be
used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For
example,
it should be easy to cleanly express, “if this string starts with `"f"`,
process
the rest of the string as follows…” Swift is well-suited to expressing
this
common pattern beautifully, but we need to add the APIs. Here are two
examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
  somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
   ...
}

The specific spelling and functionality of APIs like this are TBD. The
larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
  the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/
Prototypes/PatternMatching.swift#L33),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals
into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the
indices of
the range from which it was sliced, operations like `firstMatch` can
return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match
in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general,
and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol
mentioned
above provides a suitable foundation for regular expressions, and types
such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could
allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`,
and
`utf16`—each with its own opaque index type. The APIs used to translate
indices
between views add needless complexity, and the opacity of indices makes
them
difficult to serialize.

The index translation problem has two aspects:

  1. `String` views cannot consume one anothers' indices without a
cumbersome
    conversion step. An index into a `String`'s `characters` must be
translated
    before it can be used as a position in its `unicodeScalars`. Although
these
    translations are rarely needed, they add conceptual and API complexity.
  2. Many APIs in the core libraries and other frameworks still expose
`String`
    positions as `Int`s and regions as `NSRange`s, which can only
reference a
    `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient
indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type.
Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets
into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think
random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and
constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a
`String`
between grapheme cluster boundaries are TBD—it can either trap or be
forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will
do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when
the
   index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
   the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views
is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-
and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
  func index(offset: IndexDistance) -> Index {
    return index(startIndex, offsetBy: offset)
  }
  func offset(of i: Index) -> IndexDistance {
    return distance(from: startIndex, to: i)
  }
}

Then integers can easily be translated into offsets into a `String`'s
`utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide
future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format
string with
textual placeholders for substitution, and an arbitrary list of other
arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and
complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the
arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but
only for
the cases where the format string is a literal. Second, there's no
reasonable
way to extend the formatting vocabulary to cover the needs of new types:
you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile,
offering
both formatting and parsing services. When used for formatting, though,
the
design pattern demands more from users than it should:

  * Matching the type of data being formatted to a formatter type
  * Creating an instance of that type
  * Setting stateful options (`currency`, `dateStyle`) on the type. Note:
the
    need for this step prevents the instance from being used and discarded
in
    the same expression where it is created.
  * Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such
that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type
safety
problems (put the data right where it belongs!) but the following issues
prevent
it from being useful for localized formatting (among other jobs):

  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to
restrict
    types used in string interpolation.
  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation
can't
    distinguish (fragments of) the base string from the string
substitutions.

In the long run, we should improve Swift string interpolation to the point
where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs
to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered
and
incoherent, with 6 ways to transform a C string into a `String` and four
ways to
do the inverse. These APIs should be replaced with the following

extension String {
  /// Constructs a `String` having the same contents as
`nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as
`nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code
units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented
as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected
will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the
validity of
the result is recorded in the `String`'s `encoding` such that future
accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired
on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which
changes
the process of properly identifying `Character` boundaries. We need to
update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored
using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some
come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by
many
different representations.

That said, the highest performance code always requires static knowledge
of the
data structures on which it operates, and for this code, dynamic selection
of
representation comes at too high a cost. Heavy-duty text processing
demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles
dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that
encode
representation information into the type, such as
`NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies
entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are
often
processed most efficiently by recognizing ASCII structural elements as
ASCII,
and capturing the arbitrary sections between them in more-general
strings. The
current String API offers no way to efficiently recognize ASCII and skip
past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is
specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
   enable unified slicing syntax.

2. **A subtype relationship** between
   `Substring` and `String`, enabling framework APIs to traffic solely in
   `String` while still making it possible to avoid copies by handling
   `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is
not in
  question here; this is about what encodings must be storable, without
  transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete
type in
  which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code
units need
  to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design
choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
  `UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String`
indices can
  be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able
to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an
API
appropriate for `String`. Instead, string APIs would be provided by a
generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such
as
access to the specific encoding, by putting them behind a `.unicode`
property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage>
  : BidirectionalCollection {

  // ...APIs for high-level string processing here...

  var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the
right
type” (`String`) without thinking, and the new APIs will show up on
`Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the
word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and
kept
around because it seemed useful, but it was never truly *designed* for
client
programmers. We need to decide what happens with it. Presumably
*something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
    substantially reduce the scope of `Int`'s API by using more
    generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard
and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](http://
unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme
clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary
would
not be found in the result), so would be relatively costly to implement,
with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned
by
  the Unicode standard for this purpose. In fact there's
  a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
  dedicated to it. In particular, §5.17 says:

  > When comparing text that is visible to end users, a correct linguistic
sort
  > should be used, as described in _Section 5.16, Sorting and
  > Searching_. However, in many circumstances the only requirement is for
a
  > fast, well-defined ordering. In such cases, a binary ordering can be
used.

  [:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly
onto
properties in a table that's indexed by unicode scalar value. This table
is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd
either
establish the existing practice that the Unicode committee would
standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Joshua Alvarado
alvaradojoshua0@gmail.com


(Russ Bishop) #18

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

Filesystem paths are Strings on Apple platforms but not on Linux. How are we going to square that circle? What about Swift on the server, where distinguishing HTML and JavaScript is security-critical? There are huge security implications to string processing, often around platforms making it easy to do the wrong thing in a careless way and promoting ad-hoc formatting, serialization and parsing. That’s a huge area to consider of course but it might be worth thinking about how a ergonomic API for a few example cases would work.

I guess my point is that formatting and interpolation is far more than “just formatting”; making the right thing difficult will directly lead to exploitable security vulnerabilities or not as the case may be. (To be clear I’m not saying the follow-on proposals from this need to solve those problems, maybe just give them some consideration).

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.

Depending on who you believe UTF-8 is the encoding of ~65-88% of all text content transmitted over the web. JSON and XML represent the lion’s share of REST and non-REST APIs in use and both are almost exclusively transmitted as UTF-8. As you point out with extendedASCII, a lot of markup and structure is ASCII even if the content is not so UTF-8 represents a significant size savings even on Chinese/Japanese web pages that require 3 bytes to represent many characters (the savings on markup overwhelming the loss on textual content).

Any model that makes using UTF-8 backed Strings difficult or cumbersome to use can have a negative performance and memory impact. I don’t have a good idea of the actual cost but it might be worth doing some test to determine that.

Is NSString interop the only reason to not just use UTF-8 as the default storage? If so, is that a solvable problem? Could one choose by typealias or a compiler flag which default storage they wanted?

- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

I suppose you could be clever on 64-bit platforms by stealing some bits to indicate the encoding… not that I recommend that :smiley:

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

Generalized Existentials
tis but happiness by another name
For we who live
in The Land of Protocols and Faeries

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

Fair point, but I do like the idea of separating the two and encouraging people to extend String while automatically extending all the String-ish types. This would compose well with a hypothetical HTMLString, JavaScriptString, etc (assuming one could design a model where those things compose well, e.g. appending MyUTF8String to HTMLString performs automatic HTML-escaping whereas appending HTMLString to HTMLString does not).

Anything that avoids forcing the average app or library author to stop and think about which String type to use is probably a net win if the performance isn’t horrible; someone writing a web server pipeline will need to write their own String-ish type for performance reasons anyway so a slight perf hit may be no great loss.

Thanks to you and Ben for the hard work so far; I can’t even imagine taking on such a task!

Russ

···

On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:


(Olivier Tardieu) #19

Thanks for the great write-up!

The manifesto recognizes the importance of machine processing and
performance.
I am surprised that there is no mention of any kind of "unsafe" strings or
string processing.
In general, Swift does an amazing job at incorporating unsafe mechanism
into a safe-by-default programming paradigm.
But for some reason, Strings seem to be left out of the unsafe discussion.

A lot of machine processing of strings continues to deal with 8-bit
quantities (even 7-bit quantities, not UTF-8).
Swift strings are not very good at that. I see progress in the manifesto
but nothing to really close the performance gap with C.
That's where "unsafe" mechanisms could come into play.

To guarantee Unicode correctness, a C string must be validated or
transformed to be considered a Swift string.
If I understand the C String interop section correctly, in Swift 4, this
should not force a copy, but traversing the string is still required.
I hope I am correct about the no-copy thing, and I would also like to
permit promoting C strings to Swift strings without validation.
This is obviously unsafe in general, but I know my strings... and I care
about performance. :wink:

More importantly, it is not possible to mutate bytes in a Swift string at
will.
Again it makes sense from the point of view of always correct Unicode
sequences.
But it does not for machine processing of C strings with C-like
performance.
Today, I can cheat using a "_public" API for this, i.e., myString._core.
_baseAddress!.
This should be doable from an official "unsafe" API.

Memory safety is also at play here, as well as ownership.
A proper API could guarantee the backing store is writable for instance,
that it is not shared.
A memory-safe but not unicode-safe API could do bounds checks.

While low-level C string processing can be done using unsafe memory
buffers with performance, the lack of bridging with "real" Swift strings
kills the deal.
No literals syntax (or costly coercions), none of the many useful string
APIs.

To illustrate these points here is a simple experiment: code written to
synthesize an http date string from a bunch of integers.
There are four versions of the code going from nice high-level Swift code
to low-level C-like code.
(Some of this code is also about avoiding ARC overheads, and string
interpolation overheads, hence the four versions.)

On my macbook pro (swiftc -O), the performance is as follows:

interpolation + func: 2.303032365s
interpolation + array: 1.224858418s
append: 0.918512377s
memcpy: 0.182104674s

While the benchmarking could be done more carefully, I think the main
observation is valid.
The nice code is more than 10x slower than the C-like code.
Moreover, the ugly-but-still-valid-Swift code is still about 5x slower
than the C like code.
For some applications, e.g. web servers, this kind of numbers matter...

Some of the proposed improvements would help with this, e.g., small
strings optimization, and maybe changes to the concatenation semantics.
But it seems to me that a big performance gap will remain.
(Concatenation even with strncat is significantly slower than memcpy for
fixed-size strings.)

I believe there is a need and an opportunity for a fast "less safe" String
API.
I hope it will be on the roadmap soon.

Best,

Olivier

import Foundation

// get current date as a series of integers
// (could be done differently... faster... not the topic)

var theTime = time(nil)
var timeStruct = tm()
gmtime_r(&theTime, &timeStruct)
let wday = Int(timeStruct.tm_wday)
let mday = Int(timeStruct.tm_mday)
let mon = Int(timeStruct.tm_mon)
let year = Int(timeStruct.tm_year) + 1900
let hour = Int(timeStruct.tm_hour)
let min = Int(timeStruct.tm_min)
let sec = Int(timeStruct.tm_sec)

let months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
              "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

let days = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]

func twoDigit(_ num: Int) -> String {
    return (num < 10 ? "0" : "") + String(num)
}

let twoDigit = ["00", "01", "02", "03", "04", "05", "06", "07", "08", "09"

···

,
                "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
,
                "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"
,
                "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"
,
                "40", "41", "42", "43", "44", "45", "46", "47", "48", "49"
,
                "50", "51", "52", "53", "54", "55", "56", "57", "58", "59"
,
                "60", "61", "62", "63", "64", "65", "66", "67", "68", "69"
,
                "70", "71", "72", "73", "74", "75", "76", "77", "78", "79"
,
                "80", "81", "82", "83", "84", "85", "86", "87", "88", "89"
,
                "90", "91", "92", "93", "94", "95", "96", "97", "98", "99"
]

// interpolation + func

func httpDate() -> String {
    return "\(days[wday]), \(twoDigit(mday)) \(months[mon]) \(year) \(
twoDigit(hour)):\(twoDigit(min)):\(twoDigit(sec)) GMT"
}

// interpolation + array

func httpDate1() -> String {
    return "\(days[wday]), \(twoDigit[mday]) \(months[mon]) \(year) \(
twoDigit[hour]):\(twoDigit[min]):\(twoDigit[sec]) GMT"
}

// append + array

func httpDate2() -> String {
    var s = days[wday]
    s.append(", ")
    s.append(twoDigit[mday])
    s.append(" ")
    s.append(months[mon])
    s.append(" ")
    s.append(twoDigit[year/100])
    s.append(twoDigit[year%100])
    s.append(" ")
    s.append(twoDigit[hour])
    s.append(":")
    s.append(twoDigit[min])
    s.append(":")
    s.append(twoDigit[sec])
    s.append(" GMT")
    return s
}

// memcpy + array

func httpDate3() -> String {
    var s = "XXX, XX XXX XXXX XX:XX:XX GMT"
    s.append("") // force alloc
    let ptr = s._core._baseAddress!
    memcpy(ptr, days[wday]._core._baseAddress!, 3)
    memcpy(ptr.advanced(by: 8), months[mon]._core._baseAddress!, 3)
    memcpy(ptr.advanced(by: 5), twoDigit[mday]._core._baseAddress!, 2)
    memcpy(ptr.advanced(by: 12), twoDigit[year/100]._core._baseAddress!, 2
)
    memcpy(ptr.advanced(by: 14), twoDigit[year%100]._core._baseAddress!, 2
)
    memcpy(ptr.advanced(by: 17), twoDigit[hour]._core._baseAddress!, 2)
    memcpy(ptr.advanced(by: 20), twoDigit[min]._core._baseAddress!, 2)
    memcpy(ptr.advanced(by: 23), twoDigit[sec]._core._baseAddress!, 2)
    return s
}

var s = ""

var now = mach_absolute_time()
for _ in 0..<1000000 {
    s = httpDate()
}
print(s)
print("interpolation + func: \(Double(mach_absolute_time() - now) / 1e9
)s\n")

now = mach_absolute_time()
for _ in 0..<1000000 {
    s = httpDate1()
}
print(s)
print("interpolation + array: \(Double(mach_absolute_time() - now) / 1e9
)s\n")

now = mach_absolute_time()
for _ in 0..<1000000 {
    s = httpDate2()
}
print(s)
print("append: \(Double(mach_absolute_time() - now) / 1e9)s\n")

now = mach_absolute_time()
for _ in 0..<1000000 {
    s = httpDate3()
}
print(s)
print("memcpy: \(Double(mach_absolute_time() - now) / 1e9)s\n")

From: Ben Cohen via swift-evolution <swift-evolution@swift.org>
To: swift-evolution <swift-evolution@swift.org>
Cc: Dave Abrahams <dabrahams@apple.com>
Date: 01/19/2017 09:56 PM
Subject: [swift-evolution] Strings in Swift 4
Sent by: swift-evolution-bounces@swift.org

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](
https://github.com/airspeedswift)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined
thus
far, with just this short blurb in the
[list of goals](
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html
):

**String re-evaluation**: String is one of the most important

fundamental

types in the language. The standard library leads have numerous ideas

of how

to improve the programming model for it, without jeopardizing the goals

of

providing a unicode-correct-by-default model. Our goal is to be better

at

string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text
processing:

  1. Ergonomics
  2. Correctness
  3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the
scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are
mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics
or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in
the
foundation additions to `String` where patterns and practices found
elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily
understood
given a signature and a one-line summary. Today, `String` fails that
test. As
you can see, the Standard Library and Foundation both contribute
significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0]
String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

  * Restoring `Collection` conformance and dropping the `.characters`
view.
  * Providing a more general, composable slicing syntax.
  * Altering `Comparable` so that parameterized
    (e.g. case-insensitive) comparison fits smoothly into the basic
syntax.
  * Clearly separating language-dependent operations on text produced
    by and for humans from language-independent
    operations on text produced by and for machine processing.
  * Relocating APIs that fall outside the domain of basic string
processing and
    discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs
for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for
domain-specific
jobs such as
[linguistic tagging](
https://developer.apple.com/reference/foundation/nslinguistictagger),
one should not need to import anything to, for example, do
case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle
behind
Swift's `String`. That said, the Unicode standard is an evolving
document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs
should
be written so their correctness does not depend on precise stability of
these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find
such
errors, the only sure way to deal with this fact of life is to educate
users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and
should be
improved, the sheer size, complexity, and diversity of these APIs is a
major
contributor to the problem, causing novices to tune out, and more
experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
   as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
    guidance in the
    [Internationalization and Localization Guide](
https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
).

Along with appropriate documentation updates, these changes will make
localized
operations more teachable, comprehensible, and approachable, thereby
lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many
operations on
Swift `String` (and `NSString`) are really only appropriate for text that
is
intended to be processed for, and consumed by, machines. The semantics of
the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
   accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not
required
   by their un-localized counterparts. This naturally skews complexity
towards
   localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of
clarifying
the proper default behavior of operations such as comparison, and allows
us to
make [significant optimizations](#collation-semantics) that were
previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but
regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by
default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that
naïve
code is very likely to be correct if it compiles, and that more
sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need
to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different
for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized
sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly
like this:

  x.compared(to: y, case: .sensitive, in: swissGerman)
 
  x.lowercased(in: .currentLocale)
 
  x.allMatches(
    somePattern, case: .insensitive, diacritic: .insensitive)

This usage might be supported by code like this:

enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
  static var currentLocale: Locale { ... }
}

extension Unicode {
  // An example of the option language in declaration context,
  // with nil defaults indicating unspecified, so defaults can be
  // driven by the presence/absence of a specific Locale
  func frobnicated(
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> Self { ... }
}

### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing—
turns
out to be quite interesting, once you pick it apart. The full Unicode
Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a
collation key
3. “Flatten” the key by concatenating the sequence of first elements to
the
   sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
incrementally, step 2 uses a collation table that maps matching
*sequences* of
unicode scalars in the normalized string to *sequences* of triples, which
get
accumulated into a collation key. Predictably, this is where the real
costs
lie.

*However*, there are some bright spots to this story. First, as it turns
out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](
http://unicode.org/reports/tr10/#Multi_Level_Comparison),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it
does
for equality: two strings that normalize the same, naturally, will collate
the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare
the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to
hashing: it
is sufficient to hash the string's normalized form, bypassing collation
keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level
applies
even to localized strings, it means that hashing and equality can be
implemented
exactly the same way for localized and non-localized text, and hash tables
with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to
handle
machine-generated and machine-readable text, the default ordering of
`String`s
need no longer use the UCA at all. It is sufficient to order them in any
way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of
grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and
ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting
behavior
consistent across platforms. Currently, we sort `String` according to the
UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are
ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with
binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into
the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](
https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
(implemented by changes at the head
of
[this branch](
https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
We should adopt a modification of that proposal that uses a method rather
than
an operator `<=>`:

enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
 func compared(to: Self) -> SortOrder
 ...
}

This change will give us a syntactic platform on which to implement
methods with
additional, defaulted arguments, thereby unifying and regularizing
comparison
across the library.

extension String {
 func compared(to: Self) -> SortOrder

}

**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also
possible
that the standard library simply adopts Foundation's `ComparisonResult` as
is,
but we believe the community should at least consider alternate naming
before
that happens. There will be an opportunity to discuss the choices in
detail
when the modified
[Comparison Proposal](
https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection`
too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`,
`elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0,
`String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of
combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes
appears to
comport perfectly with Unicode. We think the concatenation problem is
tolerable,
because the cases where it occurs all represent partially-formed
constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING
ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The
other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text
editor.

Admitting these cases encourages exploration of grapheme composition and
is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases
that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can
handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

  * Collection-like operations encourage experimentation with strings to
    investigate and understand their behavior. This is useful for teaching
new
    programmers, but also good for experienced programmers who want to
    understand more about strings/unicode.

  * Extended grapheme clusters form a natural element boundary for Unicode
    strings. For example, searching and matching operations will always
produce
    results that line up on grapheme cluster boundaries.

  * Character-by-character processing is a legitimate thing to do in many
real
    use-cases, including parsing, pattern matching, and language-specific
    transformations such as transliteration.

  * `Collection` conformance makes a wide variety of powerful operations
    available that are appropriate to `String`'s default role as the
vehicle for
    machine processed text.

    The methods `String` would inherit from `Collection`, where similar to
    higher-level string algorithms, have the right semantics. For
example,
    grapheme-wise `lexicographicalCompare`, `elementsEqual`, and
application of
    `flatMap` with case-conversion, produce the same results one would
expect
    from whole-string ordering comparison, equality comparison, and
    case-conversion, respectively. `reverse` operates correctly on
graphemes,
    keeping diacritics moored to their base characters and leaving emoji
intact.
    Other methods such as `indexOf` and `contains` make obvious sense. A
few
    `Collection` methods, like `min` and `max`, may not be particularly
useful
    on `String`, but we don't consider that to be a problem worth solving,
in
    the same way that we wouldn't try to suppress `min` and `max` on a
    `Set([UInt8])` that was used to store IP addresses.

  * Many of the higher-level operations that we want to provide for
`String`s,
    such as parsing and pattern matching, should apply to any
`Collection`, and
    many of the benefits we want for `Collections`, such
    as unified slicing, should accrue
    equally to `String`. Making `String` part of the same protocol
hierarchy
    allows us to write these operations once and not worry about keeping
the
    benefits in sync.

  * Slicing strings into substrings is a crucial part of the vocabulary of
    string processing, and all other sliceable things are `Collection`s.
    Because of its collection-like behavior, users naturally think of
`String`
    in collection terms, but run into frustrating limitations where it
fails to
    conform and are left to wonder where all the differences lie. Many
simply
    “correct” this limitation by declaring a trivial conformance:

    ```swift
  extension String : BidirectionalCollection {}
    ```

    Even if we removed indexing-by-element from `String`, users could
still do
    this:

    ```swift
      extension String : BidirectionalCollection {
        subscript(i: Index) -> Character { return characters[i] }
      }
    ```

    It would be much better to legitimize the conformance to `Collection`
and
    simply document the oddity of any concatenation corner-cases, than to
deny
    users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not*
mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this,
we should:

- Add a `unicodeScalars` view much like `String`'s, so that the
sub-structure
   of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for
sequences
   that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
   String`, `var isASCII: Bool`, and, to the extent they can be sensibly
   generalized, queries of unicode properties that should also be exposed
on
   `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift
`UnicodeScalar`
type. This means it is usable on `String`, but only by going through the
unicode
scalar view. To deal with this clash in the short term, `CharacterSet`
should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate
to
introduce a `CharacterSet` that provides similar functionality for
extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling
and
their naming:

  * Slices with two explicit endpoints are done with subscript, and
support
    in-place mutation:

    ```swift
        s[i..<j].mutate()
    ```

  * Slicing from an index to the end, or from the start to an index, is
done
    with a method and does not support in-place mutation:
    ```swift
        s.prefix(upTo: i).readOnly()
    ```

Prefix and suffix operations should be migrated to be subscripting
operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](
https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md
).
With generic subscripting in the language, that will allow us to collapse
a wide
variety of methods and subscript overloads into a single implementation,
and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like
`s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the
potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three
options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when
making the substring.
3. Make substrings a different type, with a storage copy on conversion to
string.

We think number 3 is the best choice. A walk-through of the tradeoffs
follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view
into a
subrange of the original `String`'s storage. This is why `String` is 3
words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings
into
multiple smaller strings. But it does mean that a stored substring keeps
the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming
languages,
because applications sometimes extract small strings from large ones and
keep
those small strings long-term. That is considered a memory leak and was
enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but
has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For
example:

foo.compare(bar, range: start..<end)

Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:

foo[start..<end].compare(bar)

Not only does this clarify to which string the range applies, it also
brings
this sub-range capability to any API that operates on `String` "for free".
So
these other combinations also work equally well:

// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())

In all three cases, an explicit range argument need not appear on the
`compare`
method itself. The implementation of `compare` does not need to know
anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of
a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory
leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice`
type.
The inconvenience of a separate type is mitigated by most operations used
on
`Array` from the standard library being generic over `Sequence` or
`Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice`
would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string,

not

just to the portion it presents, even after the original string's

lifetime

ends. Long-term storage of a `Substring` may therefore prolong the

lifetime

of large strings that are no longer otherwise accessible, which can

appear

to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be
performed, and
at this point the substring buffer is copied and the original string's
storage
can be released.

A `String` that was not its own `Substring` could be one word—a single
tagged
pointer—without requiring additional allocations. `Substring`s would be a
view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and
a
length. The small string optimization for `Substring` would take advantage
of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having
a
`Substring` when you need a `String`, and vice-versa. It is likely this
would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant
to
existing code that assumes `String` is the currency type. To ease the pain
of
type mismatches, `Substring` should be a subtype of `String` in the same
way
that `Int` is a subtype of `Optional<Int>`. This would give users an
implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship
should
make the type difference a non-issue and users will not care which type
they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is
always
a reasonable default**. A `Substring` passed where `String` is expected
will be
implicitly copied. When compared to the “same type, copied storage” model,
we
have effectively deferred the cost of copying from the point where a
substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this
guideline:
if for performance reasons you are tempted to add a `Range` argument to
your
method as well as a `String` to avoid unnecessary copies, you should
instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a
`String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you
have is
the `Collection`), we propose the following “empty subscript” operation,

extension Collection {
  subscript() -> SubSequence { 
    return self[startIndex..<endIndex] 
  }
}

which allows the following usage:

funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring

The `[]` syntax can be offered as a fixit when needed, similar to `&` for
an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be
done with
a simple `map` (which could also be offered by a fixit):

takesAnArrayOfSubstring(arrayOfString.map { $0[] })

#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are
proposing
one such mitigation—implicit conversion—as part of the the "different
type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy
of
the underlying storage when it detects the string is being "stored" for
long
term usage, say when it is assigned to a stored property. The trouble with
this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely
on
stored properties. For example, should the storing of a substring inside
an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array
of
substrings. It would also be difficult to distinguish intentional
medium-term
storage of substrings, say by a lexer. There does not appear to be an
effective
consistent rule that could be applied in the general case for detecting
when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage",
the
optimizer could be enhanced to to reduce the impact of some of those
copies.
For example, this code could be optimized to pull the invariant substring
out
of the loop:

for _ in 0..<lots { 
  someFunc(takingString: bigString[bigRange]) 
}

It's worth noting that a similar optimization is needed to avoid an
equivalent
problem with implicit conversion in the "different type, shared storage"
case:

let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }

However, in the case of "same type, copied storage" there are many use
cases
that cannot be optimized as easily. Consider the following simple
definition of
a recursive `contains` algorithm, which when substring slicing is linear
makes
the overall algorithm quadratic:

extension String {
    func containsChar(_ x: Character) -> Bool {
        return !isEmpty && (first == x || dropFirst().containsChar(x))
    }
}

For the optimizer to eliminate this problem is unrealistic, forcing the
user to
remember to optimize the code to not use string slicing if they want it to
be
efficient (assuming they remember):

extension String {
    // add optional argument tracking progress through the string
    func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> 
Bool {
        let idx = idx ?? startIndex
        return idx != endIndex
            && (self[idx] == x || containsCharacter(x, atOrAfter: 
index(after: idx)))
    }
}

#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several
Objective-C
APIs, and is made especially awkward in Swift by the
non-interchangeability of
`Range<String.Index>` and `NSRange`.

s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))

In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:

s2.find(s2[j..<s2.endIndex])

Therefore, APIs that operate on an `NSString`/`NSRange` pair should be
imported
without the `NSRange` argument. The Objective-C importer should be
changed to
give these APIs special treatment so that when a `Substring` is passed,
instead
of being converted to a `String`, the full `NSString` and range are passed
to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs,
which
solves the impedance problem by eliminating the argument, resulting in
more
idiomatic Swift code while retaining the performance benefit. To help
users
manually handle any cases that remain, Foundation should be augmented to
allow
the following syntax for converting to and from `NSRange`:

let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s)    // Equivalent to i..<j

### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string
processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a
protocol.
Since Unicode conformance is a key feature of string processing in swift,
we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not
yet implemented in
  Swift, and should be considered a sketch rather than a final design.

protocol Unicode 
  : Comparable, BidirectionalCollection where Element == Character {
 
  associatedtype Encoding : UnicodeEncoding
  var encoding: Encoding { get }
 
  associatedtype CodeUnits 
    : RandomAccessCollection where Element == Encoding.CodeUnit
  var codeUnits: CodeUnits { get }
 
  associatedtype UnicodeScalars 
    : BidirectionalCollection  where Element == UnicodeScalar
  var unicodeScalars: UnicodeScalars { get }

  associatedtype ExtendedASCII 
    : BidirectionalCollection where Element == UInt32
  var extendedASCII: ExtendedASCII { get }

  var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
  // ... define high-level non-mutating string operations, e.g. search ...

  func compared<Other: Unicode>(
    to rhs: Other,
    case caseSensitivity: StringSensitivity? = nil,
    diacritic diacriticSensitivity: StringSensitivity? = nil,
    width widthSensitivity: StringSensitivity? = nil,
    in locale: Locale? = nil
  ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
  RangeReplaceableCollection {
    // Satisfy protocol requirement
    mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: 
C) 
      where C.Element == Element
 
  // ... define high-level mutating string operations, e.g. replace ...
}

The goal is that `Unicode` exposes the underlying encoding and code units
in
such a way that for types with a known representation (e.g. a
high-performance
`UTF8String`) that information can be known at compile-time and can be
used to
generate a single path, while still allowing types like `String` that
admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For
example,
it should be easy to cleanly express, “if this string starts with `"f"`,
process
the rest of the string as follows…” Swift is well-suited to expressing
this
common pattern beautifully, but we need to add the APIs. Here are two
examples
of the sort of code that might be possible given such APIs:

if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
  somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
   ...
}

The specific spelling and functionality of APIs like this are TBD. The
larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
  the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](
https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33
),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals
into
patterns, the following pairs would all invoke the same generic methods:

if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)

Note that, because Swift requires the indices of a slice to match the
indices of
the range from which it was sliced, operations like `firstMatch` can
return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match
in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general,
and
would fall out of this proposal:

// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])

#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol
mentioned
above provides a suitable foundation for regular expressions, and types
such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could
allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`,
and
`utf16`—each with its own opaque index type. The APIs used to translate
indices
between views add needless complexity, and the opacity of indices makes
them
difficult to serialize.

The index translation problem has two aspects:

  1. `String` views cannot consume one anothers' indices without a
cumbersome
    conversion step. An index into a `String`'s `characters` must be
translated
    before it can be used as a position in its `unicodeScalars`. Although
these
    translations are rarely needed, they add conceptual and API
complexity.
  2. Many APIs in the core libraries and other frameworks still expose
`String`
    positions as `Int`s and regions as `NSRange`s, which can only
reference a
    `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient
indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type.
Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets
into
underlying code unit storage makes a good underlying storage type,
provided
`String`'s underlying storage supports random-access. We think
random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and
constructible
solves the serialization problem:

clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)

Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a
`String`
between grapheme cluster boundaries are TBD—it can either trap or be
forgiving).
Having a common index allows easy traversal into the interior of
graphemes,
something that is often needed, without making it likely that someone will
do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when
the
   index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
   the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views
is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:

let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)

#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this
document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:

extension Collection {
  func index(offset: IndexDistance) -> Index {
    return index(startIndex, offsetBy: offset)
  }
  func offset(of i: Index) -> IndexDistance {
    return distance(from: startIndex, to: i)
  }
}

Then integers can easily be translated into offsets into a `String`'s
`utf16`
view for consumption by Cocoa:

let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)

### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide
future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format
string with
textual placeholders for substitution, and an arbitrary list of other
arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and
complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First,
the
spelling of these placeholders must match up to the types of the
arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but
only for
the cases where the format string is a literal. Second, there's no
reasonable
way to extend the formatting vocabulary to cover the needs of new types:
you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile,
offering
both formatting and parsing services. When used for formatting, though,
the
design pattern demands more from users than it should:

  * Matching the type of data being formatted to a formatter type
  * Creating an instance of that type
  * Setting stateful options (`currency`, `dateStyle`) on the type. Note:
the
    need for this step prevents the instance from being used and discarded
in
    the same expression where it is created.
  * Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such
that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to
printf's
domain-specific language (just write ordinary swift code!) and its type
safety
problems (put the data right where it belongs!) but the following issues
prevent
it from being useful for localized formatting (among other jobs):

  * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to
restrict
    types used in string interpolation.
  * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation
can't
    distinguish (fragments of) the base string from the string
substitutions.

In the long run, we should improve Swift string interpolation to the point
where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs
to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

### C String Interop

Our support for interoperation with nul-terminated C strings is scattered
and
incoherent, with 6 ways to transform a C string into a `String` and four
ways to
do the inverse. These APIs should be replaced with the following

extension String {
  /// Constructs a `String` having the same contents as 
`nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 
encoded 
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
 
  /// Constructs a `String` having the same contents as 
`nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code 
units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
 
  /// Invokes the given closure on the contents of the string, represented 
as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

In both of the construction APIs, any invalid encoding sequence detected
will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the
validity of
the result is recorded in the `String`'s `encoding` such that future
accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired
on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which
changes
the process of properly identifying `Character` boundaries. We need to
update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored
using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some
come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by
many
different representations.

That said, the highest performance code always requires static knowledge
of the
data structures on which it operates, and for this code, dynamic selection
of
representation comes at too high a cost. Heavy-duty text processing
demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles
dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that
encode
representation information into the type, such as
`NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies
entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are
often
processed most efficiently by recognizing ASCII structural elements as
ASCII,
and capturing the arbitrary sections between them in more-general strings.
The
current String API offers no way to efficiently recognize ASCII and skip
past
everything else without the overhead of full decoding into unicode
scalars.

For these purposes, strings should supply an `extendedASCII` view that is
a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is
specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
   enable unified slicing syntax.

2. **A subtype relationship** between
   `Substring` and `String`, enabling framework APIs to traffic solely in
   `String` while still making it possible to avoid copies by handling
   `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is
not in
  question here; this is about what encodings must be storable, without
  transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete
type in
  which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code
units need
  to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design
choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units
as
  `UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String`
indices can
  be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able
to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an
API
appropriate for `String`. Instead, string APIs would be provided by a
generic
wrapper around an instance of `Unicode`:

struct StringFacade<U: Unicode> : BidirectionalCollection {

  // ...APIs for high-level string processing here...
 
  var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>

This design would allow us to de-emphasize lower-level `String` APIs such
as
access to the specific encoding, by putting them behind a `.unicode`
property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:

struct String<U: Unicode = StringStorage> 
  : BidirectionalCollection {

  // ...APIs for high-level string processing here...
 
  var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>

One advantage of such a design is that naïve users will always extend “the
right
type” (`String`) without thinking, and the new APIs will show up on
`Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the
word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it
adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and
kept
around because it seemed useful, but it was never truly *designed* for
client
programmers. We need to decide what happens with it. Presumably
*something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

<b id="f0">0</b> The integers rewrite currently underway is expected to
    substantially reduce the scope of `Int`'s API by using more
    generics. [:leftwards_arrow_with_hook:](#a0)

<b id="f1">1</b> In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard
and its
de-facto extension, CLDR.[:leftwards_arrow_with_hook:](#a1)

<b id="f2">2</b>
See
[http://unicode.org/reports/tr29/#Notation](
http://unicode.org/reports/tr29/#Notation). Note
that inserting Unicode scalar values to prevent merging of grapheme
clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary
would
not be found in the result), so would be relatively costly to implement,
with
little benefit. [:leftwards_arrow_with_hook:](#a2)

<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned
by
  the Unicode standard for this purpose. In fact there's
  a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
  dedicated to it. In particular, §5.17 says:

  > When comparing text that is visible to end users, a correct linguistic
sort
  > should be used, as described in _Section 5.16, Sorting and
  > Searching_. However, in many circumstances the only requirement is for
a
  > fast, well-defined ordering. In such cases, a binary ordering can be
used.

  [:leftwards_arrow_with_hook:](#a4)

<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly
onto
properties in a table that's indexed by unicode scalar value. This table
is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd
either
establish the existing practice that the Unicode committee would
standardize, or
the Unicode committee would do the research and we'd implement their
result.[:leftwards_arrow_with_hook:](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Ben Rimmington) #20

<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#f1>

In practice, these semantics will usually be tied to the version of the installed ICU library, which programmatically encodes the most complex rules of the Unicode Standard and its de-facto extension, CLDR.

Unicode is released in June. <http://www.unicode.org/versions/#schedule>
CLDR is released in March and September. <http://cldr.unicode.org/#TOC-General-Schedule->
ICU is released in April and October. <http://site.icu-project.org/download>

Therefore "libicucore" on Apple platforms will always use an older Unicode standard.
For example, iOS 10 and macOS Sierra are using ICU 57, which doesn't support Unicode 9.

Could you include the latest ICU alongside the Swift standard library?

-- Ben