State of String: ABI, Performance, Ergonomics, and You!

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

# State of String: ABI, Performance, Ergonomics, and You!

Hello, I’ve been working on implementing, optimizing, and improving String in preparation for ABI stability, and I thought I’d share the current status of String in Swift 5 and some potential directions to go. This is the product of conversations with open source contributors and my colleagues on the Swift standard library team at Apple.

The primary area of focus is stabilizing String’s ABI, but we’d be remiss if we couldn’t also fit in performance and ergonomic improvements. String’s ergonomics in particular is one area where we think the status quo is woefully inadequate, and over half of this email is devoted to that topic. At the end, there’s a section about a community initiative that we hope can help users of String as well as guide future development.

(Note: I’m sending this to swift-dev because much of the contents revolve around implementation concerns. I’ll also cross-reference on swift-evolution and swift-users. See also the [StringManifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md\) from Swift 4 era, which elaborates more on the philosophy and justification of many of the things discussed here.)

## String ABI

String’s ABI consists of its in-memory layout, any methods that are public or callable from inlineable methods, and any interpretation of its bit pattern by inlineable methods. The challenge in designing String’s ABI is determining what avenues are worth keeping open to pursue in the future and what should be closed off for good to ensure performance.

In a non-ABI-stable world, the Swift optimizer is able to peer deeply into the bowels of String’s implementation to perform analyses and optimizations. But, ABI stability introduces resilience barriers through which the optimizer cannot peer. Keeping reasonable performance while stabilizing an ABI requires careful crafting of the String type as well as reevaluating many implementation choices. This area is critical to String’s long term health, even though it has little user-visibility in the short term.

For example, as an implementation detail, String uses UTF-16 as its underlying encoding. For values that fit within the first 7-bit subset of UTF-16 (that is, ASCII), String uses 1-byte backing storage to save space. UTF-16 provides the best interop story for application development and interacting with ICU. But, other domains such as scripting, systems programming, and server-side development strongly prefer UTF-8 and transcoding may not serve their performance needs. If this implementation detail can remain behind a resilience barrier, that is if the performance cost is not too high, then String could evolve in the future to natively support backing UTF-8 storage.

The details are still in flux, and likely not of interest to most readers. Much of the re-gutting work is happening on the appropriately-named [StringGuts branch](https://github.com/apple/swift/pull/12442\).

## String Performance

Strings are pervasive in any system and their performance matters. We’re planning on two major efforts to improve performance this release: comparison improvements and small-string optimizations. Additionally, internal to the standard library, we’re introducing and using unmanaged strings and some performance flags, which may be worth surfacing as API for highly-performance-sensitive uses.

### String Comparison

**Strings should compare consistently**, with all the nice properties we know and love such as transitivity. String honors Unicode canonical equivalence, i.e. comparison is modulo-normalization. This means that ự (U+1EF1: LATIN SMALL LETTER U WITH HORN AND DOT BELOW) compares equivalent to ư ◌̣ (U+01B0 U+0323: LATIN SMALL LETTER U WITH HORN, COMBINING DOT BELOW) and u ◌̣ ◌̛ (U+0075 U+0323 U+031B: LATIN SMALL LETTER U, COMBINING DOT BELOW, COMBINING HORN), and all the other wonderful wacky permutations thereof.

**String’s ordering is not sufficient for human presentation**, as there is no ordering that’s appropriate for all languages or uses. For example, `ö` comes before `z` in German, but not in Swedish. String’s comparison operators (`==` and `<`) do not attempt to provide locale-specific orderings, but instead a *universal* ordering, that is one which is useful for maintaining programmer invariants such as the sorted-ness of a data structure. String’s choice to honor Unicode canonical equivalence strikes a balance between universality and efficiency while helping most programmers avoid common pitfalls in dealing with Unicode.

(Quick note on localization: Localization is not within the purview of the standard library as it’s better provided by the platform, i.e. Foundation. If you’re presenting an ordering to a user, use a locale-aware method from Foundation such as `localizedStandardCompare()`.)

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

### Small String Optimization

Many strings are small. It is wasteful and inefficient to store them on the heap and track their lifetimes when the String struct could store it inline, i.e. encoded directly in the struct’s bit pattern. Substrings, if they cover a small span of a potentially-large String, can avoid retaining ownership of the storage and instead encode their contents inline with an offset to map indices.

NSString has leveraged this kind of optimization on 64 bit platforms to great effect through tagged-pointer representations. Swift’s String has the advantage of being a struct and, depending on the final ABI, having more bits and flags available. We will probably have multiple small string representations.

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

### Unmanaged Strings

One of the ways that the standard library internally is dealing with the upcoming optimization barriers is by restructuring much of String’s implementation to be on top of an unmanaged form. These are rough analogues to UnsafeBufferPointer, i.e. a pointer and length with flags and string-semantic operations defined on them. Like small strings, these still have type String as they are just a different form encoded in the struct’s bit pattern.

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses. As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

* Does it make sense to introduce a new type (or types) here, or is it better to base everything on a sort of `withUnsafeCodeUnits`-like construct that supplies UnsafeBufferPointers?
* How does 1-byte or 2-byte backing storage surface? Some options:
  * Closures receive optional arguments. E.g. `withUnsafeUTF16CodeUnits` passes an `UnsafeBufferPointer<UInt16>?` to its closure. String would also have to be queryable as to the nature of the underlying storage, and those queries also become new API and ABI.
  * The backing storage is abstracted away sufficiently. I.e. the 1-byte or 2-byte decision is tracked by the type provided by e.g. `withUnsafeCodeUnits`. This trades a little performance for simpler usage. This would probably require a new type which in turn is new API and ABI.
  * Operations are given an enum of the different forms. That enum also becomes API as well as ABI.
    * Is the enum open or closed? Can this evolve over time?
  * Note that whether the underlying storage is 1-byte or 2-byte is not necessarily stable version-to-version, e.g. if String utilized UTF-8 storage for mostly-ASCII Strings in the future.
* Many Strings are effectively immutable and immortal, as in, the String’s lifetime will definitely outlive all potential uses (e.g. String literals). Would this work also encompass uses of StaticString? Can they be unified?
* What about a low-level API for efficiently building Strings? Could the user call `reserveCapacity` and then get a buffer pointer to excess capacity, updating count along the way?
* If this is niche and specifically for performance, should it be UnsafeString instead and skip bounds checks in release configurations?
* The APIs need to be designed and named. UnmanagedString’s subscript loads raw code units, but it also provides the functionality to perform grapheme breaking (as String is layered on top of it). How can this be designed in a way that implies semantic symmetry with String?
* How would small strings operate? Should the APIs fail on them, or receive a scratch buffer to spill them into, or be abstracted by the type?

### Performance Flags

Many operations such as decoding, comparison (under the new model), and grapheme breaking have fast-paths for common situations where the portion of the String under consideration does not exhibit the full complexity of Unicode. We plan on introducing flags denoting whether the *entire* String has these properties, allowing the implementation to just execute these fast-paths without repeatedly querying the underlying code units and with better instruction locality. Unsetting a flag will always preserve semantics, just slightly regress performance (the fast-paths will remain). Some flags include:

* **isTriviallyEncoded**: If all scalars are comprised of a single code unit, then decoding is trivial: just zero-extend.
* **isCanonical**: If a String is already in our preferred canonical form for comparison purposes, then honoring canonical equivalence is trivial.
* **isSingleScalarGrapheme**: If all graphemes are composed of a single scalar, then grapheme breaking is trivial.

String is a struct and these flags are stored inline, so they can only be set upon creation or inside a mutable method. In general, it’s not worth expensive checks to set these flags or onerous bookkeeping to maintain them, but a best-effort can be made when it’s cheap to do so. Since unsetting a flag is semantics-preserving, flags can be reserved for future use by always setting them to 0 in the current version. In this way, flags can be added in a backwards-compatible way, where older versions ignore them.

For highly performance-sensitive code, it might be worth having a mutating method to perform a linear scan and update these flags. This should be a very fast check to update inline flags, and exposing it as API should probably also expose getters to query flags. We haven’t thought deeply about how best to surface this for judicious use without creating an easily [mis-applied API](http://perldoc.perl.org/functions/study.html\) littered throughout code bases.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.
* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

## String Ergonomics

Ergonomics is an area that’s not been well-served by String. Collection conformance and multi-line literals in Swift 4 were a welcome first step. But, trying to write a script, solve a classic whiteboard interview question, or participate in a coding competition can quickly become an exercise in frustration.

String ergonomics is a long term, multi-release endeavor. It has to continually compete for focus against near-term critical needs such as ABI stability and performance. The below provides some potential focus areas in the Swift 5 timeframe, though we probably won’t be able to focus on them all.

### API Gaps

String’s APIs have a lot of holes in them, and we’re hoping the effort mentioned later in this email can help expose them. Filling these gaps is always in-scope for any release.

#### Character Properties

One immediately obvious gap is Character, which lacks basic queries. For example, it should be possible to write something like `myStr.filter { !$0.isWhitespace }` to get a string with all whitespace removed.

While this can quickly get bogged down in the details (Q: “Does the Mongolian vowel separator count as whitespace?” A: Depends on the version of Unicode!), there should be obvious, well-understood, generally useful queries that Character can answer. This area has been well-trodden by CharacterSet and its static properties can provide guidance on useful queries. Some potential queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, etc. These should probably also be present on UnicodeScalar.

These might be useful on other types such as Substring, where they would apply to the whole slice. E.g., it should be possible to write `myStr.split(separator: ";").filter { $0.isAlphaNumeric }` rather than `myStr.split(separator: ";").filter { !$0.contains { !$0.isAlphaNumeric } }`

Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.

### Interpolation: Building Strings Up

String interpolation, especially when used in conjunction with other language features such as multi-line string literals, present a compelling, visual way to construct strings. For example, the following theoretical printout of a benchmark result:

print("""
 Result: \(benchmark.result)
 Execution time:
   user(ms): \(benchmark.userTime.milliseconds)
   system(ms): \(benchmark.systemTime.milliseconds)
   wall(ms): \(benchmark.wallTime.milliseconds)
 """)

String interpolation has many issues and limitations:

* Current interface is inefficient. Individual segments are turned into Strings before interpolation. While small string optimizations alleviate some of this, larger segments incur unneeded heap traffic and tracking.
* While it is [fairly customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/\), the interface is clunky and deprecated. Brent Royal-Gordon has clearly articulated the problems and provided potential solutions in [Fix ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md\). This approach does need to be adjusted to address the performance issue above.
* Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).

Additionally, some concept of string formatters would greatly aid the ergonomics of constructing strings, whether inside interpolations or not.

### Regex: Tearing Strings Apart

Perhaps the most pressing area in need of ergonomic aid is tearing strings apart to extract information embedded within them. This can take many forms, and we would like to focus the discussion on language-level support for string matching literals, aka “regexes”. (Obligatory note that regexes here may or may not have a loose relationship to regular expressions in formal language theory, depending on details.)

This wouldn’t be worth doing if all it afforded was replacing `“foo”` with `/foo/` inside calls to NSRegularExpression. Regexes become compelling when viewed in conjunction with other language constructs such as switches (i.e. a new kind of [Pattern](Patterns — The Swift Programming Language (Swift 5.7))), an ability to bind to variables, interpolate other matchers, a rich type system, etc.

Before delving into details on feature set, semantics, and implementation, here’s a quick tour of how they could fit into Swift.

#### One Possible Approach

This is one, brief and hand-wavy, approach to adding regexes to Swift. This is not a proposal, and **all syntax is a total strawman** meant for illustration purposes only. The regex literal syntax below is an adaptation of [Chris Lattner’s strawman syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html\). In this syntax, whitespace is insignificant.

###### Compile Regex Literals into `Regex<T>`

let usPhoneNumber = /
 (let area: Int? <- \d{3}?) -
 (let routing: Int <- \d{3}) -
 (let local: Int <- \d{4}) /
print(type(of: usPhoneNumber)) // => Regex<(area: Int?, routing: Int, local: Int)>
...
let (_, routing, local) = try usPhoneNumber.match(myStr)!
print(type(of: routing)) // => Int

If no type is specified, the default is `Substring` covering the match.

###### Types in Captures Conform to Some Protocol

extension Int: RegexSubmatchableiblewobble {
 init?(regexSubmatch: Substring)
}

###### Captures and Interpolation Propagated Through Type System

let countryCode = / \+ (let country: Int <- \d+) /
print(type(of: countryCode)) // => Regex<(country: Int)>

let betterUSPhoneNumber = /
 (let country: Int? <- countryCode?)
 (let (area, routing, local) <- usPhoneNumber) /
print(type(of: betterUSPhoneNumber)) 
 // => Regex<(country: Int?, area: Int?, routing: Int, local: Int)>

let nestedCapture = /
 (let country: Int? <- countryCode?)
 (let number <- usPhoneNumber) /
print(type(of: nestedCapture)) 
 // => Regex<(country: Int?, number: (area: Int?, routing: Int, local: Int))> 

This alone is not sufficient to use regexes in pattern matching, as `~=` returns a bool rather than a tuple describing a match. For that, we need to **generalize pattern matching in Swift**.

###### Pattern Matching Through Conformance to `Pattern`

protocol Pattern {
 associatedtype In
 associatedtype Out
 func match(_: In) -> Out?
}

`~=` can still be checked if needed for compatibility. Alternatively, a default implementation of `match` could be provided in terms of `~=`, where `Out` is `Void`, i.e. only match success or failure. We may want a default implementation of match generic over any collection that `In` is a SubSequence of, to address the `Slice` vs `Collection` or `Substring` vs `String` divide.

###### Syntax for Destructing Matching

Introduce some kind of syntax for a new [value-binding-pattern](Patterns — The Swift Programming Language (Swift 5.7)) that performs destructing. E.g.

let myValue: T = ...
let pattern: MyTPattern<(U,U)> = ... // matches a T to (U, U)
let otherPattern: OtherTPattern<V> = ... // matches a T to V
switch myValue {
 case (let a: Int, let b: Int) <- pattern: ...
 case let pair <- pattern: ... // pair has type (U,U)
 case let d: Double <- otherPattern: ...
 case let num <- otherPattern: // num has type V
}

which also (hand-wavy timey-wimey) invokes some failable init on `Int` taking a `U`, `Double` taking a `V`, etc., guided by some protocol.

###### Regexes are Just Patterns

struct Regex<T> {
 var program: CompiledRegex<T>
}
extension Regex: Pattern {
 typealias In = Substring
 typealias Out = T
 func match(_ s: In) -> Out? {
   // magic goes here
 }
}

Regex *literals* may have special syntax, but otherwise regexes are not a special case in the language and libraries can take advantage of generalized destructing matching. Similarly, character classes (which may be user defined, e.g. with a closure), could be Patterns over Characters.

###### Regexes Provide Variants

extension Regex {
 var caseInsensitive: Regex<T> { get }
 var allMatches: Regex<[T]> { get }
 var lineBased: Regex<[T]> { get }
 // ... matches with ranges, maybe even a Match object, etc.
 // ... lazy allMatches variant, etc.
}

###### Final Contrived Example (for Illustrative Purposes)

switch myStringlyTypedNumber {
case let (_, area?, routing, local) <- betterUSPhoneNumber: // ...
case let hexNumStr <- 
	/ 0 x (let num <- [0-9 A-F]+) /.caseInsensitive: // ...
case let (mantissa: Double, exponent: Int) <- scientificNumber: // ...
}

#### Feature Set

The examples above merely illustrate how regex literals could appear, but we need to carefully design the feature set in accordance with Swift’s goals and principles as a language.

One goal of Swift is to eventually [be better at string processing than Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\), so let’s take a look at Perl. Perl’s history is full of interesting insights. Perl organically evolved into the de-facto standard for regex-based string processing culminating in Perl 5. More importantly, they reflected on the [“mess”](https://perl6.org/archive/doc/design/apo/A05.html\) and cleaned it up in [Perl 6](https://docs.perl6.org/language/regexes\). Perl 6 also has a [PEG](https://dl.acm.org/citation.cfm?id=964001.964011 <https://dl.acm.org/citation.cfm?id=964001.964011&gt;\)-like [grammar system](https://docs.perl6.org/language/grammars\) and we think similar approaches may be a part of Swift’s long-term future. For now, we’re focusing on regexes, although future considerations do impact the desired feature set.

Swift contrasts with Perl in that Swift places greater importance on clarity and simplicity over raw expressive power and extreme configurability. However, these are some features Swift should outright steal from Perl 6.

* **Free form syntax**. Whitespace is insignificant, comments can occur inside literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a name.
* **Multi-line matching**. A new line is just another character, not a match terminator.
* **Regex interpolation**. Refactor complex regexes into simpler ones (with names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 regexes!
* Miscellaneous features common to modern regex systems, such as:
  * Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
  * Non-capturing grouping, ratcheting (“atomic”) grouping
  * Separator-quantification (“[modified quantifier](Regexes)”)
  * String literals
  * Rich set of character classes (more on that next)

##### Character Classes

We will want a broad array of builtin pre-defined character classes for use in regexes. Additionally, users should be able to construct and define their own, whether from CharacterSets (i.e. enumeration) or through an arbitrary closure of type `(Character) -> Bool`. There should be syntax to use user-defined character classes in a regex literal.

Character classes should provide many set-algebraic operations such as union, intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note that for our purposes, the standard library’s `SetAlgebra` protocol is not appropriate for character classes, as `SetAlgebra` requires conformances to be able to check for subset relationships which is intractable, if not undecidable, for closures over graphemes.)

##### Arguable Features

Some literal syntactic features common in other regex systems may be unnecessary or a bad fit for Swift.

1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
  * If String had better substitution APIs, such as one that took a `Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s delimiters)
  * These could be Regex variants accessible via computed properties, which may be clearer
3. Anonymous (numbered) captures
  * In general, if something’s worth capturing it’s worth giving it a name.
  * A construct like `(let _ = \d+)` could produce an unlabeled tuple element.
  * Tuples (the representation for literal captures) support 0-based access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in `[A-Z]`)
  * Convenient for ASCII, but meaningless for graphemes.
  * Good built-in classes and convenient ways for users to combine and enumerate their own may even obviate the need for ranges for ASCII.

##### Forbidden Features (with Alternatives)

There are some regex features that we argue should be disallowed, or at least carefully constrained. Regexes are powerful, specifying simple textual analysis in a concise syntax, but are easily misused in ways that add exponentially-compounding complexity in non-obvious ways. This misuse contrasts with Swift’s values of balancing expressivity with clarity, where complex behavior should be *explicitly expressed in code*.

However, we think there’s so much flexibility in what can accomplished without the most general form of these features that it might not be an issue in practice.

###### 1. Arbitrarily-Complex Look-Around Assertions

Assertions are the ability for a regex to “peek” ahead or behind at surrounding characters without consuming them as part of a match. They can be positive or negative. Arbitrarily-complex assertions include the ability to do captures as part of an assertion, unknown-length assertions satisfied by subpatterns, etc., which can have non-intuitively complex interactions with other parts of the regex.

**Alternative:** The following constrained assertions could be permissible:

1. Anchors (zero-width assertions), which test for positions within a string, e.g. `^` and `$`.

2. Fixed-length character-class or closure-based assertions, which help avoid [common pitfalls in negation](regex - Regular expression to match a line that doesn't contain a word - Stack Overflow) as well as aid readability. This would permit fixed-length arbitrary computation *explicitly expressed in code*. These might also encompass boundaries.

3. Invoking a closure on the string prefix/suffix prior/after the assertion point. This would permit variable-length arbitrary computation, but *explicitly expressed in code*.

###### 2. Arbitrary Uses of Backreferences

Backreferences allow testing a later portion of a string against a previously-matched portion. Unfortunately, in their most general form, they require exploration of the entire search space and [make matching NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html\), that is take exponential time. Very simple uses of backreferences just want to check for an unambiguous capture after the fact, for which a `where` clause suffices. Very complex uses are better served with an exponential loop *explicitly expressed in code*.

**Alternative:** There is a range of convenient uses between these two extremes, which we may be able to tractably address. The feasibility of these is still under investigation, but some potential constrained backreferences may include:

1. A “ratcheting” backreference, where upon matching (i.e. the current section of the string satisfies the backreference), the referred-to capture is “frozen” in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the user to write a real grammar/parser (assuming some future grammar/parser description system in Swift), simple delimiter matching can greatly aid readability. This might even be general and powerful enough to define tasks such as pairing XML tags and ignoring escaped delimiters or delimiters inside literals. Ideally, this would be done in a direction conducive to a future parsing solution.

#### Semantics

Matching semantics should roughly corresponding to [UTS #18 Level 2](UTS #18: Unicode Regular Expressions) of Unicode support. That is, “any character” would be any Swift Character (a grapheme), matching obeys canonical equivalence, etc. This is consistent with String’s overall philosophy of universality and harm reduction. This does require that the implementation use Unicode rules and tables provided at run time, e.g. for grapheme breaking.

Regex interpolation should effectively “inline” the interpolated regex, i.e. [formal concatenation](Regular expression - Wikipedia). We will probably also want different combination semantics for combining patterns in general (including regexes), possibly in conjunction with a future parsing solution. For example, a form of alternation that is short-circuiting and some form of concatenation that is [“ratcheting”](Regexes), where something like `(myRegex1 || myRegex2) ++ myRegex3` tries myRegex1 first, and if successful myRegex2 is *never* tried regardless of whether myRegex3 succeeds or not. This is similar to the semantics of [PEGs](https://en.wikipedia.org/wiki/Parsing_expression_grammar\) or parser-combinators and much more closely reflects programmer intuition for parsing.

We’ll need to nail down a lot of important details:

* How are alternations matched? [LTM](https://perlgeek.de/en/article/longest-token-matching\)?
* What are the built in character classes and what do they cover?
  * What Unicode properties to expose?
  * Should there be API symmetry with Character queries mentioned above?
* What are the built in anchors, fixed-length assertions, and can they be customized?
* What is the type of a quantified capture? An array of the capture’s declared type?
* What guarantees about runtime and space complexity do we provide, if any?
* While we may have variants for whole-string, partial-string, line-by-line, etc., what should the defaults be?

#### Implementation

With static knowledge, many regexes can be compiled into [efficient, minimal state machines](https://swtch.com/~rsc/regexp/regexp1.html\) (albeit with Unicode rules and tables as input at run time). [Virtual machine approaches](https://swtch.com/~rsc/regexp/regexp2.html\) can make it easier to specify and guarantee behavior while also being much more extensible. They also may alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable while growing new features.

Delivering ergonomic regexes in Swift 5 should not be blocked on having an efficient implementation available immediately. However, long term, it is important to have a [predictable and consistent performance model](Home · google/re2 Wiki · GitHub) to prevent [unanticipated exponential behavior](https://en.wikipedia.org/wiki/ReDoS\). Thus, implementation concerns are relevant in scoping the initial feature set (e.g. regarding backreferences).

Some regexes are amenable to task-parallel strategies (i.e. threads), data-parallel strategies (i.e. vector or GPU), or both. Swift will probably grow new strategies and combinations of strategies over time. Implementations can also take advantage of the aforementioned performance flags (e.g. isCanonical) to greatly speed up matches.

The interface to a regex engine will be part of Swift’s ABI, including whatever intermediate representation regexes are compiled into. This could be a simple regex AST, a virtual machine bytecode, etc., but it needs to be extensible to accommodate any potential future additions and implementation strategies.

### Recap: Potential Additions for Swift 5

* Some form of unmanaged or unsafe Strings, and corresponding APIs
* Exposing performance flags, and some way to request a scan to populate them
* API gaps
* Character and UnicodeScalar properties, such as isNewline
* Generalizing, and optimizing, String interpolation
* Regex literals, Regex type, and generalized pattern match destructuring
* Substitution APIs, in conjunction with Regexes.

## Community

With Swift’s mailing lists [moving to Discourse](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html\), this allows the opportunity for better community engagement and cross-referencing posts. We plan on starting an area under the “Using Swift” category focusing on idiomatic and/or performant solutions to string programming needs. These might be anything from simple programming problems to make-my-parser-faster challenges.

While this provides a community benefit by showing how best to use String, our ulterior motive is to mine posts looking for benchmarks, API gaps, and representative code that new features could improve. We’ll also, of course, want to bikeshed the name! My vote is for “String Dojo”, but there’s probably a reason I don’t normally name things.

5 Likes

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations. Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?
- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.
- array has exactly the same sort of concern, why is this a string-specific problem?
- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

···

On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

## String Ergonomics

Ergonomics is an area that’s not been well-served by String. Collection conformance and multi-line literals in Swift 4 were a welcome first step. But, trying to write a script, solve a classic whiteboard interview question, or participate in a coding competition can quickly become an exercise in frustration.

Indeed. It has come a long way, but isn’t there yet.

#### Character Properties

One immediately obvious gap is Character, which lacks basic queries. For example, it should be possible to write something like `myStr.filter { !$0.isWhitespace }` to get a string with all whitespace removed.

Yes, I think that this is critical for multiple reasons. First of which is direct utility, which you’re focusing on. The second of which is that I think that this directly dovetails into the regex syntax discussion: ideally we’d have named character classes and be able to explain them as being *exactly* defined as a corresponding character classification property.

It would be great to get a property for each of (e.g.) the Perl 6 character classes:

I consider this (along with all the annoying work to define what they should match) as a prerequisite for getting regex’s nailed down.

While this can quickly get bogged down in the details (Q: “Does the Mongolian vowel separator count as whitespace?” A: Depends on the version of Unicode!), there should be obvious, well-understood, generally useful queries that Character can answer. This area has been well-trodden by CharacterSet and its static properties can provide guidance on useful queries. Some potential queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, etc. These should probably also be present on UnicodeScalar.

Relatedly, the Foundation CharacterSet type is extremely confusing because it isn’t a set of swift Characters. I don’t suppose there is any appetite to rectify or generalize it somehow is there? I wouldn’t overly rely on it for guidance on these issues give that it it stuck so squarely in the realm of UTF16.

Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.

No it isn’t. A Grapheme-cluster based analog of CharacterSet would be a reasonable model to consider though. It could conceptually support predicates like isEmoji()

### Interpolation: Building Strings Up

String interpolation, especially when used in conjunction with other language features such as multi-line string literals, present a compelling, visual way to construct strings. For example, the following theoretical printout of a benchmark result:

print("""
 Result: \(benchmark.result)
 Execution time:
   user(ms): \(benchmark.userTime.milliseconds)
   system(ms): \(benchmark.systemTime.milliseconds)
   wall(ms): \(benchmark.wallTime.milliseconds)
 """)

String interpolation has many issues and limitations:

* Current interface is inefficient. Individual segments are turned into Strings before interpolation. While small string optimizations alleviate some of this, larger segments incur unneeded heap traffic and tracking.

Completely agreed: personally I’d consider this a critical P1 to fix for ABI stability, far more important than the ergonomic issues you’re describing here.

* While it is [fairly customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/\), the interface is clunky and deprecated. Brent Royal-Gordon has clearly articulated the problems and provided potential solutions in [Fix ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md\). This approach does need to be adjusted to address the performance issue above.

Yes, yes yes.

* Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).

The original (2013 era) idea was to allow for interpolants to take multiple arguments (the contents of \(….) would be passed as the argument list, so you could specify formatting information as subsequent parameters, e.g.:

    “your file access is set to \(mode, .Octal)”.

or:
    “your file access is set to \(mode, format: .Octal)”.
  
or something like that. Of course each type would define its own formatting modes, and this would probably be reflected through a type-specific initializer.

### Regex: Tearing Strings Apart

This is clearly necessary but also clearly not part of Swift 5. Regex’s should themselves be the subject of an intense design process. Swift 6 pretty please??? I’d suggest splitting this section out and starting a regex manifesto. :-)

This wouldn’t be worth doing if all it afforded was replacing `“foo”` with `/foo/` inside calls to NSRegularExpression. Regexes become compelling when viewed in conjunction with other language constructs such as switches (i.e. a new kind of [Pattern](The Swift Programming Language: Redirect)), an ability to bind to variables, interpolate other matchers, a rich type system, etc.

+1

#### One Possible Approach

This is one, brief and hand-wavy, approach to adding regexes to Swift. This is not a proposal, and **all syntax is a total strawman** meant for illustration purposes only. The regex literal syntax below is an adaptation of [Chris Lattner’s strawman syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html\). In this syntax, whitespace is insignificant.

Nice. :-)

###### Compile Regex Literals into `Regex<T>`

Agreed.

Ok, I might sound like a madman here, but you realize that there is a really logical progression here:

1. You teach the compiler to parse and validate regex literals.
2. You teach the compiler to know what regex’s are actually regular, so you can use linear time algorithms like DFA/NFAs instead of exponential algorithms where possible.
3. You teach the compiler to form the DFA.
4. You then teach the compiler that a switch on string where you have multiple regular regex’s can be turned into a flex style “lexer” that matches all of the cases in parallel.
5. You rewrite the Swift compiler in Swift, because obviously the lexer was the hard part. :-)

QED

#### Feature Set

The examples above merely illustrate how regex literals could appear, but we need to carefully design the feature set in accordance with Swift’s goals and principles as a language.

One goal of Swift is to eventually [be better at string processing than Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\), so let’s take a look at Perl. Perl’s history is full of interesting insights. Perl organically evolved into the de-facto standard for regex-based string processing culminating in Perl 5. More importantly, they reflected on the [“mess”](GitHub - Raku/old-design-docs: Raku language design documents) and cleaned it up in [Perl 6](GitHub - Raku/old-design-docs: Raku language design documents). Perl 6 also has a [PEG](Parsing expression grammars: a recognition-based syntactic foundation: ACM SIGPLAN Notices: Vol 39, No 1 <https://dl.acm.org/citation.cfm?id=964001.964011&gt;\)-like [grammar system](GitHub - Raku/old-design-docs: Raku language design documents) and we think similar approaches may be a part of Swift’s long-term future. For now, we’re focusing on regexes, although future considerations do impact the desired feature set.

Completely agreed. I’d strongly suggest thinking about this in terms of defining what the ultimate goal is, and then defining a progressive set of steps towards it. We should design a solution that can support grammars in the fullness of time, but shouldn’t block on that to get basic regexes.

Swift contrasts with Perl in that Swift places greater importance on clarity and simplicity over raw expressive power and extreme configurability. However, these are some features Swift should outright steal from Perl 6.

* **Free form syntax**. Whitespace is insignificant, comments can occur inside literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a name.
* **Multi-line matching**. A new line is just another character, not a match terminator.
* **Regex interpolation**. Refactor complex regexes into simpler ones (with names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 regexes!
* Miscellaneous features common to modern regex systems, such as:
  * Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
  * Non-capturing grouping, ratcheting (“atomic”) grouping
  * Separator-quantification (“[modified quantifier](GitHub - Raku/old-design-docs: Raku language design documents)”)
  * String literals
  * Rich set of character classes (more on that next)

I want to live in the future, give it to me now :-) :-)

##### Character Classes

We will want a broad array of builtin pre-defined character classes for use in regexes. Additionally, users should be able to construct and define their own, whether from CharacterSets (i.e. enumeration) or through an arbitrary closure of type `(Character) -> Bool`. There should be syntax to use user-defined character classes in a regex literal.

Character classes should provide many set-algebraic operations such as union, intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note that for our purposes, the standard library’s `SetAlgebra` protocol is not appropriate for character classes, as `SetAlgebra` requires conformances to be able to check for subset relationships which is intractable, if not undecidable, for closures over graphemes.)

Have you considered making the ‘named character class’ regex feature be sugar for calling Character methods inline in the rest of the regex? This seems best to me both because it offers user extensibility as well as an easy way to explain the behavior of the model.

##### Arguable Features

Some literal syntactic features common in other regex systems may be unnecessary or a bad fit for Swift.

1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
  * If String had better substitution APIs, such as one that took a `Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s delimiters)
  * These could be Regex variants accessible via computed properties, which may be clearer
3. Anonymous (numbered) captures
  * In general, if something’s worth capturing it’s worth giving it a name.
  * A construct like `(let _ = \d+)` could produce an unlabeled tuple element.
  * Tuples (the representation for literal captures) support 0-based access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in `[A-Z]`)
  * Convenient for ASCII, but meaningless for graphemes.
  * Good built-in classes and convenient ways for users to combine and enumerate their own may even obviate the need for ranges for ASCII.

Agreed that all of those are marginal or discardable except for #1. IMO, we’ll need some solution for it, but it need not have first class language syntax like sed: a well designed API would be perfectly fine.

##### Forbidden Features (with Alternatives)

There are some regex features that we argue should be disallowed, or at least carefully constrained. Regexes are powerful, specifying simple textual analysis in a concise syntax, but are easily misused in ways that add exponentially-compounding complexity in non-obvious ways. This misuse contrasts with Swift’s values of balancing expressivity with clarity, where complex behavior should be *explicitly expressed in code*.

Right, this is one of the things that I think would be nice about tying character classes together with methods. The really nice thing about this is that tying it to predicates on Characters keeps usage of these as *actually regular* which permits efficient implementation, even if the generated code does a call to the user defined method.

However, we think there’s so much flexibility in what can accomplished without the most general form of these features that it might not be an issue in practice.

###### 1. Arbitrarily-Complex Look-Around Assertions

Assertions are the ability for a regex to “peek” ahead or behind at surrounding characters without consuming them as part of a match. They can be positive or negative. Arbitrarily-complex assertions include the ability to do captures as part of an assertion, unknown-length assertions satisfied by subpatterns, etc., which can have non-intuitively complex interactions with other parts of the regex.

**Alternative:** The following constrained assertions could be permissible:

1. Anchors (zero-width assertions), which test for positions within a string, e.g. `^` and `$`.

^ and $ anchors are clearly ok. They keep it regular and are widely used and understood. I don’t see a pitfall to supporting them.

2. Fixed-length character-class or closure-based assertions, which help avoid [common pitfalls in negation](regex - Regular expression to match a line that doesn't contain a word - Stack Overflow) as well as aid readability. This would permit fixed-length arbitrary computation *explicitly expressed in code*. These might also encompass boundaries.

3. Invoking a closure on the string prefix/suffix prior/after the assertion point. This would permit variable-length arbitrary computation, but *explicitly expressed in code*.

I don’t understand the tradeoffs involved on this - I’d suggest lazily evaluating this after the general model is more nailed down. We’ll know more and understand the implications of the decisions better.

###### 2. Arbitrary Uses of Backreferences

Backreferences allow testing a later portion of a string against a previously-matched portion. Unfortunately, in their most general form, they require exploration of the entire search space and [make matching NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html\), that is take exponential time. Very simple uses of backreferences just want to check for an unambiguous capture after the fact, for which a `where` clause suffices. Very complex uses are better served with an exponential loop *explicitly expressed in code*.

**Alternative:** There is a range of convenient uses between these two extremes, which we may be able to tractably address. The feasibility of these is still under investigation, but some potential constrained backreferences may include:

1. A “ratcheting” backreference, where upon matching (i.e. the current section of the string satisfies the backreference), the referred-to capture is “frozen” in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the user to write a real grammar/parser (assuming some future grammar/parser description system in Swift), simple delimiter matching can greatly aid readability. This might even be general and powerful enough to define tasks such as pairing XML tags and ignoring escaped delimiters or delimiters inside literals. Ideally, this would be done in a direction conducive to a future parsing solution.

Interesting, I’m not familiar with the tradeoffs implied by these. I have always assumed that we would support the generality of exponential regex matching but optimize the common regular cases. It would be awesome if there is a better model, but I hope we don’t give up too much familiarity and power in the process.

#### Semantics

Matching semantics should roughly corresponding to [UTS #18 Level 2](UTS #18: Unicode Regular Expressions) of Unicode support. That is, “any character” would be any Swift Character (a grapheme), matching obeys canonical equivalence, etc. This is consistent with String’s overall philosophy of universality and harm reduction. This does require that the implementation use Unicode rules and tables provided at run time, e.g. for grapheme breaking.

+1. Of course we’d fastpath ASCII.

#### Implementation

With static knowledge, many regexes can be compiled into [efficient, minimal state machines](https://swtch.com/~rsc/regexp/regexp1.html\) (albeit with Unicode rules and tables as input at run time). [Virtual machine approaches](https://swtch.com/~rsc/regexp/regexp2.html\) can make it easier to specify and guarantee behavior while also being much more extensible. They also may alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable while growing new features.

Delivering ergonomic regexes in Swift 5 should not be blocked on having an efficient implementation available immediately. However, long term, it is important to have a [predictable and consistent performance model](Home · google/re2 Wiki · GitHub) to prevent [unanticipated exponential behavior](https://en.wikipedia.org/wiki/ReDoS\). Thus, implementation concerns are relevant in scoping the initial feature set (e.g. regarding backreferences).

Agreed. We should define the model with the aim of making it easily optimizable. Once the model is defined and implemented, getting compiler-generated DFA/NFA implementations would be a great intern project.

Some regexes are amenable to task-parallel strategies (i.e. threads), data-parallel strategies (i.e. vector or GPU), or both. Swift will probably grow new strategies and combinations of strategies over time. Implementations can also take advantage of the aforementioned performance flags (e.g. isCanonical) to greatly speed up matches.

The interface to a regex engine will be part of Swift’s ABI, including whatever intermediate representation regexes are compiled into. This could be a simple regex AST, a virtual machine bytecode, etc., but it needs to be extensible to accommodate any potential future additions and implementation strategies.

Obviously we should define a new Turing complete bytecode format for the regex VM, then build an LLVM backend that generates it, then compile the swift compiler into it, then embed that into the Swift compiler and use it to implement a first class macro system in Swift.

Oh wait, no we shouldn’t. :-)

-Chris

···

On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

1 Like

^ String should NOT be an Any!

-Chris

···

On Jan 10, 2018, at 9:29 PM, Chris Lattner <clattner@nondot.org> wrote:

On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations. Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

RT. in my own code i’ve mostly taken to using [UInt8] or [Unicode.Scalar] and matching them against integer literals because that’s still easier than dealing with Strings and Characters and grapheme breaking functionality i don’t need for the use case.

If you ask me tho String should really be split into two types, with different purposes. String right now is very high level, for complex, generalized, emojified user-facing and user-input text. Now we’re trying to make it good for low-level stuff without compromising on the high-level stuff and it’s just making a mess of an API and permutations on permutations of flags.

We need to have an StringArray<Codeunit> type which is like [UInt8] or [Unicode.Scalar] except with an API for stringy operations. Two tools for two jobs. StringArrays could be written with single quote literals in the syntax,, they’re completely unused and no one wants to use them for regexes so why not make them StringArrays. They’d use bitwise equalities so things like (e, ‘) won’t compare equal to (é) *by default*. Because how many HTML tags have names with a combining character in them? Last time you read a file format specification where the markers to match weren’t a table of 4-character ASCII strings?

···

On Jan 11, 2018, at 12:36 AM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:
(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations. Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?
- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.
- array has exactly the same sort of concern, why is this a string-specific problem?
- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

In this particular example, you can use UInt8.init(ascii:) to make it a little nicer (no need for the charToByte function):

    func hexToInt(_ c : UInt8) -> UInt8 {
        switch c {
        case UInt8(ascii: "0")...UInt8(ascii: "9"):
            return c - UInt8(ascii: "0")
        case UInt8(ascii: "a")...UInt8(ascii: "f"):
            return c - UInt8(ascii: "a") + 10
        case UInt8(ascii: "A")...UInt8(ascii: "F"):
            return c - UInt8(ascii: "A") + 10
        default: fatalError("invalid hexadecimal character")
        }
    }

···

On 11. Jan 2018, at 06:36, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations. Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?
- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.
- array has exactly the same sort of concern, why is this a string-specific problem?
- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

Hi Chris!

+CC Michael Gottesman, as I veer into talking about ARC.

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations.

nit: Multiple small representations would slow down nearly every operation on *small* strings. That is, we would initially branch on a isSmall check, and then small strings would pay the cost of further inspection. This could also slow down String construction, depending on how aggressively we try to pack/transcode and whether we also have specialized inits.

Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified). The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.

One of the nice things about having a small representation for each of our backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a very fast way of conservatively checking whether we form a small string. E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, we can check whether the range holds 7 or fewer code units to avoid the ARC. I don’t know if it would be worth the attempt to detect whether it would fit in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the extra packing attempt may help branch mis-predicts, though I’m *way* out of my depth when it comes to reasoning about mis-predicts as they emerge in the wild. There is a “temporal locality of Unicode-ness” in string processing, though.

As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a small UTF-8 string could only store 5. This may not be a big deal.

As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:

1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.

2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them).

If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!

3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2 word is currently winning. Again, up to further investigation and debate. FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style pointer+length+capacity, but I haven’t looked into their String vs OsString vs OsStr vs … model.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

When it comes to naming, I’m more prone to argue for “UnsafeString” rather than “UnmanagedString”, but that’s a later debate for SE ;-)

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Yup!

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?

Michael Gottesman (+CC) is doing this sort of investigation right now, evaluating the tradeoffs of more wide-spread use of the +0 convention. My current understanding is that even with +0, many retain/release pairs cannot be eliminated without more knowledge of the callee, but I don’t know the reasons why. Hopefully he can elaborate.

- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.

Right, it’s a little more general.

- array has exactly the same sort of concern, why is this a string-specific problem?

I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.

- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

Good point. If this is more of a “UnsafeString”-like concept, there’s analogy with Unsafe[Mutable]BufferPointer.

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

Yes, though you would lose the ability to give them to APIs expecting a String when you know it’s safe to do so. E.g., when the contents are effectively immortal relative to the API call and effects. Careful use of “unsafe” or “unmanaged” in type names and argument labels needed here.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

I was thinking that if we always zero-fill our storage classes and reserve one extra character, then the flag would be redundant with the discriminator bits. However, that’s assuming the string doesn’t store a nul character in the middle. I agree this might need a separate perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:

extension Character {
    public var isASCII: Bool {
        guard let scalar = unicodeScalars.first, unicodeScalars.count == 1 else { return false }
        return scalar.value <= 0x7f
    }
    public var isASCIIAlphaNumeric: Bool {
        guard isASCII else { return false }
        switch unicodeScalars.first! {
        case "0"..."9": return true
        case "A"..."Z": return true
        case "a"..."z": return true
        default: return false
        }
    }
}

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

···

On Jan 10, 2018, at 9:29 PM, Chris Lattner <clattner@nondot.org> wrote:
On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

(I’m still working on a response on the ABI-side of this conversation, which deserves more time and thought)

## String Ergonomics

Ergonomics is an area that’s not been well-served by String. Collection conformance and multi-line literals in Swift 4 were a welcome first step. But, trying to write a script, solve a classic whiteboard interview question, or participate in a coding competition can quickly become an exercise in frustration.

Indeed. It has come a long way, but isn’t there yet.

#### Character Properties

One immediately obvious gap is Character, which lacks basic queries. For example, it should be possible to write something like `myStr.filter { !$0.isWhitespace }` to get a string with all whitespace removed.

Yes, I think that this is critical for multiple reasons. First of which is direct utility, which you’re focusing on. The second of which is that I think that this directly dovetails into the regex syntax discussion: ideally we’d have named character classes and be able to explain them as being *exactly* defined as a corresponding character classification property.

It would be great to get a property for each of (e.g.) the Perl 6 character classes:
GitHub - Raku/old-design-docs: Raku language design documents

I consider this (along with all the annoying work to define what they should match) as a prerequisite for getting regex’s nailed down.

I agree 100%

While this can quickly get bogged down in the details (Q: “Does the Mongolian vowel separator count as whitespace?” A: Depends on the version of Unicode!), there should be obvious, well-understood, generally useful queries that Character can answer. This area has been well-trodden by CharacterSet and its static properties can provide guidance on useful queries. Some potential queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, etc. These should probably also be present on UnicodeScalar.

Relatedly, the Foundation CharacterSet type is extremely confusing because it isn’t a set of swift Characters. I don’t suppose there is any appetite to rectify or generalize it somehow is there?

We could rectify it through renaming, which could be proposed in a source-compatible way. I don’t know the naming history here, but I assumed it had some kind of term-of-art status. As far as generalizing, that may be more complicated than it seems (more details below in response to “A Grapheme-cluster based analog…”).

I wouldn’t overly rely on it for guidance on these issues give that it it stuck so squarely in the realm of UTF16.

Wading a little into the weeds here, CharacterSet’s builtins model older Unicode semantics than what we probably want to provide. E.g. CharacterSet.lowercaseLetters means general category “Ll”, while modern Unicode defines property Lowercase which includes more scalars. The Lowercase property vs CharacterSet.lowercaseLetters is equivalent to ICU’s u_isULowercase [1] vs u_islower [2], where the former is the modern recommended API. Contrarily, Perl 6’s <lower> built-in rule[3] seems to also just be “Ll” and I’m not familiar with the rationale there (compatibility?). Certainly needs investigation.

I supposed by “guidance” I meant “here are some of the things people commonly ask about characters”. Built-in character classes in various regex flavors also provide guidance.

[1] u_isULowercase: ICU-docs - International Components for Unicode Docs | icu-docs
[2] u_islower: ICU-docs - International Components for Unicode Docs | icu-docs
[3] GitHub - Raku/old-design-docs: Raku language design documents

Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.

No it isn’t. A Grapheme-cluster based analog of CharacterSet would be a reasonable model to consider though. It could conceptually support predicates like isEmoji()

In the regex section I talk about character classes as being modeled by `(Character) -> Bool` rather than something that conforms to SetAlgebra. One big reason (beyond being awesomely convenient) is because I don’t think graphemes are something to be enumerated.

Theoretically speaking, I’m not sure if the set of graphemes is even recursively enumerable (e.g. zalgo-text or zalgo-emoji). Practically speaking, it darn-well might as well be uncountable. Best case, if even possible, any enumeration would be emergent behavior of encoding details and change significantly with each Unicode version. If we cannot enumerate graphemes, we cannot answer equality, isDisjoint, isSubset, etc.

I don’t know if those operations are important for a grapheme analogue. They just demonstrate that a grapheme set follows a different model than a scalar set, and different than what we have traditionally used the word “Set” for in Swift, though you would know the history here better.

As far as uses of such a grapheme set are concerned, they seem equivalent to application of a regex consisting of a character class. For example, rangeOfCharacter(from:options:) could be accomplished with the below. In an ever-futile attempt to avoid focusing on specific syntax, I’ll form an army of straw-people, using 「」 delimiters for a regex literal, « » to denote character classes, subscript on String to model application of a Regex, parenthesis to denote a capture that must always be a decl (even if the name is dropped by using `_`), and Regex literals defaulting to whole-string matching rather than the first partial-string match:

extension Character {
    var isEmoji: Bool { get }
}
extension Regex {
    var rangeOfMatch: Regex<(Range<String.Index>)> // Drops the captured value, just wants the range
    var allMatches: Regex<LazySequence<T>>
}

let emojiPattern =「(let _ = «.isEmoji»)」 // Regex<(Character)>

let theFirst = myString[emojiPattern.rangeOfMatch.firstPartial] // Range<String.Index>
let allOfThem = myString[emojiPattern.rangeOfMatch.allMatches] // LazySequence<Range<String.Index>>

(Alternatively, Regex could have an init taking a (Character) -> Bool)

In such a world, what do you image the role of a grapheme set type is? It might have more uses as provided by the platform, alongside things such as localization, linguistic analysis, etc.

### Interpolation: Building Strings Up

String interpolation, especially when used in conjunction with other language features such as multi-line string literals, present a compelling, visual way to construct strings. For example, the following theoretical printout of a benchmark result:

print("""
 Result: \(benchmark.result)
 Execution time:
   user(ms): \(benchmark.userTime.milliseconds)
   system(ms): \(benchmark.systemTime.milliseconds)
   wall(ms): \(benchmark.wallTime.milliseconds)
 """)

String interpolation has many issues and limitations:

* Current interface is inefficient. Individual segments are turned into Strings before interpolation. While small string optimizations alleviate some of this, larger segments incur unneeded heap traffic and tracking.

Completely agreed: personally I’d consider this a critical P1 to fix for ABI stability, far more important than the ergonomic issues you’re describing here.

* While it is [fairly customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/\), the interface is clunky and deprecated. Brent Royal-Gordon has clearly articulated the problems and provided potential solutions in [Fix ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md\). This approach does need to be adjusted to address the performance issue above.

Yes, yes yes.

* Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).

The original (2013 era) idea was to allow for interpolants to take multiple arguments (the contents of \(….) would be passed as the argument list, so you could specify formatting information as subsequent parameters, e.g.:

    “your file access is set to \(mode, .Octal)”.

or:
    “your file access is set to \(mode, format: .Octal)”.
  
or something like that. Of course each type would define its own formatting modes, and this would probably be reflected through a type-specific initializer.

That looks very similar to one of Brent’s proposals (that I can no longer find a link to). Was there a reason it didn’t happen other than not getting around to it?

### Regex: Tearing Strings Apart

This is clearly necessary but also clearly not part of Swift 5. Regex’s should themselves be the subject of an intense design process. Swift 6 pretty please??? I’d suggest splitting this section out and starting a regex manifesto. :-)

Since it is so big, I feel like if we’re going to make any progress we need to start soon. Of course, any focus will be preempted as needed by ABI stability. Even if Swift N doesn’t get the fully formed feature, we’ll pick up things along the way. E.g. as you mention there’s a parity between Character’s Bool properties and character classes.

I definitely want to circulate and discuss these ideas a little bit before writing a “manifesto”.

This wouldn’t be worth doing if all it afforded was replacing `“foo”` with `/foo/` inside calls to NSRegularExpression. Regexes become compelling when viewed in conjunction with other language constructs such as switches (i.e. a new kind of [Pattern](The Swift Programming Language: Redirect)), an ability to bind to variables, interpolate other matchers, a rich type system, etc.

+1

#### One Possible Approach

This is one, brief and hand-wavy, approach to adding regexes to Swift. This is not a proposal, and **all syntax is a total strawman** meant for illustration purposes only. The regex literal syntax below is an adaptation of [Chris Lattner’s strawman syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html\). In this syntax, whitespace is insignificant.

Nice. :-)

###### Compile Regex Literals into `Regex<T>`

Agreed.

Ok, I might sound like a madman here, but you realize that there is a really logical progression here:

1. You teach the compiler to parse and validate regex literals.
2. You teach the compiler to know what regex’s are actually regular, so you can use linear time algorithms like DFA/NFAs instead of exponential algorithms where possible.
3. You teach the compiler to form the DFA.
4. You then teach the compiler that a switch on string where you have multiple regular regex’s can be turned into a flex style “lexer” that matches all of the cases in parallel.
5. You rewrite the Swift compiler in Swift, because obviously the lexer was the hard part. :-)

Clearly the hardest part!

QED

#### Feature Set

The examples above merely illustrate how regex literals could appear, but we need to carefully design the feature set in accordance with Swift’s goals and principles as a language.

One goal of Swift is to eventually [be better at string processing than Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\), so let’s take a look at Perl. Perl’s history is full of interesting insights. Perl organically evolved into the de-facto standard for regex-based string processing culminating in Perl 5. More importantly, they reflected on the [“mess”](GitHub - Raku/old-design-docs: Raku language design documents) and cleaned it up in [Perl 6](GitHub - Raku/old-design-docs: Raku language design documents). Perl 6 also has a [PEG](Parsing expression grammars: a recognition-based syntactic foundation: ACM SIGPLAN Notices: Vol 39, No 1 <https://dl.acm.org/citation.cfm?id=964001.964011&gt;\)-like [grammar system](GitHub - Raku/old-design-docs: Raku language design documents) and we think similar approaches may be a part of Swift’s long-term future. For now, we’re focusing on regexes, although future considerations do impact the desired feature set.

Completely agreed. I’d strongly suggest thinking about this in terms of defining what the ultimate goal is, and then defining a progressive set of steps towards it. We should design a solution that can support grammars in the fullness of time, but shouldn’t block on that to get basic regexes.

I definitely want to see where some discussion here leads and whether the “StringDojo” effort provides good example fodder.

Swift contrasts with Perl in that Swift places greater importance on clarity and simplicity over raw expressive power and extreme configurability. However, these are some features Swift should outright steal from Perl 6.

* **Free form syntax**. Whitespace is insignificant, comments can occur inside literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a name.
* **Multi-line matching**. A new line is just another character, not a match terminator.
* **Regex interpolation**. Refactor complex regexes into simpler ones (with names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 regexes!
* Miscellaneous features common to modern regex systems, such as:
  * Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
  * Non-capturing grouping, ratcheting (“atomic”) grouping
  * Separator-quantification (“[modified quantifier](GitHub - Raku/old-design-docs: Raku language design documents)”)
  * String literals
  * Rich set of character classes (more on that next)

I want to live in the future, give it to me now :-) :-)

##### Character Classes

We will want a broad array of builtin pre-defined character classes for use in regexes. Additionally, users should be able to construct and define their own, whether from CharacterSets (i.e. enumeration) or through an arbitrary closure of type `(Character) -> Bool`. There should be syntax to use user-defined character classes in a regex literal.

Character classes should provide many set-algebraic operations such as union, intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note that for our purposes, the standard library’s `SetAlgebra` protocol is not appropriate for character classes, as `SetAlgebra` requires conformances to be able to check for subset relationships which is intractable, if not undecidable, for closures over graphemes.)

Have you considered making the ‘named character class’ regex feature be sugar for calling Character methods inline in the rest of the regex?

If by “Character methods” you mean “such as those new Unicode computed properties of type Bool we’re adding”, then that’s the plan!

This seems best to me both because it offers user extensibility as well as an easy way to explain the behavior of the model.

##### Arguable Features

Some literal syntactic features common in other regex systems may be unnecessary or a bad fit for Swift.

1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
  * If String had better substitution APIs, such as one that took a `Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s delimiters)
  * These could be Regex variants accessible via computed properties, which may be clearer
3. Anonymous (numbered) captures
  * In general, if something’s worth capturing it’s worth giving it a name.
  * A construct like `(let _ = \d+)` could produce an unlabeled tuple element.
  * Tuples (the representation for literal captures) support 0-based access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in `[A-Z]`)
  * Convenient for ASCII, but meaningless for graphemes.
  * Good built-in classes and convenient ways for users to combine and enumerate their own may even obviate the need for ranges for ASCII.

Agreed that all of those are marginal or discardable except for #1. IMO, we’ll need some solution for it, but it need not have first class language syntax like sed: a well designed API would be perfectly fine.

Right, something like (strawman):

extension String {
  mutating func substitute<T>(_: Regex<T>, _: (T) -> String)
}

##### Forbidden Features (with Alternatives)

There are some regex features that we argue should be disallowed, or at least carefully constrained. Regexes are powerful, specifying simple textual analysis in a concise syntax, but are easily misused in ways that add exponentially-compounding complexity in non-obvious ways. This misuse contrasts with Swift’s values of balancing expressivity with clarity, where complex behavior should be *explicitly expressed in code*.

Right, this is one of the things that I think would be nice about tying character classes together with methods. The really nice thing about this is that tying it to predicates on Characters keeps usage of these as *actually regular* which permits efficient implementation, even if the generated code does a call to the user defined method.

+1, and communicating that such a method could be invoked many times in an unspecified order.

However, we think there’s so much flexibility in what can accomplished without the most general form of these features that it might not be an issue in practice.

###### 1. Arbitrarily-Complex Look-Around Assertions

Assertions are the ability for a regex to “peek” ahead or behind at surrounding characters without consuming them as part of a match. They can be positive or negative. Arbitrarily-complex assertions include the ability to do captures as part of an assertion, unknown-length assertions satisfied by subpatterns, etc., which can have non-intuitively complex interactions with other parts of the regex.

**Alternative:** The following constrained assertions could be permissible:

1. Anchors (zero-width assertions), which test for positions within a string, e.g. `^` and `$`.

^ and $ anchors are clearly ok. They keep it regular and are widely used and understood. I don’t see a pitfall to supporting them.

2. Fixed-length character-class or closure-based assertions, which help avoid [common pitfalls in negation](regex - Regular expression to match a line that doesn't contain a word - Stack Overflow) as well as aid readability. This would permit fixed-length arbitrary computation *explicitly expressed in code*. These might also encompass boundaries.

3. Invoking a closure on the string prefix/suffix prior/after the assertion point. This would permit variable-length arbitrary computation, but *explicitly expressed in code*.

I don’t understand the tradeoffs involved on this - I’d suggest lazily evaluating this after the general model is more nailed down. We’ll know more and understand the implications of the decisions better.

###### 2. Arbitrary Uses of Backreferences

Backreferences allow testing a later portion of a string against a previously-matched portion. Unfortunately, in their most general form, they require exploration of the entire search space and [make matching NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html\), that is take exponential time. Very simple uses of backreferences just want to check for an unambiguous capture after the fact, for which a `where` clause suffices. Very complex uses are better served with an exponential loop *explicitly expressed in code*.

**Alternative:** There is a range of convenient uses between these two extremes, which we may be able to tractably address. The feasibility of these is still under investigation, but some potential constrained backreferences may include:

1. A “ratcheting” backreference, where upon matching (i.e. the current section of the string satisfies the backreference), the referred-to capture is “frozen” in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the user to write a real grammar/parser (assuming some future grammar/parser description system in Swift), simple delimiter matching can greatly aid readability. This might even be general and powerful enough to define tasks such as pairing XML tags and ignoring escaped delimiters or delimiters inside literals. Ideally, this would be done in a direction conducive to a future parsing solution.

Interesting, I’m not familiar with the tradeoffs implied by these. I have always assumed that we would support the generality of exponential regex matching but optimize the common regular cases. It would be awesome if there is a better model, but I hope we don’t give up too much familiarity and power in the process.

I think these will be clearer after more prototyping and investigation. My main inspiration for how regexes could fit in a language is Perl 6, and my inspiration for implementation is RE2. Perl 6 strongly encourages users off of regex and onto grammars (made up of “ratcheting” tokens), which is a more appropriate tool. RE2 forbids backreferences and most non-trivial assertions altogether. I think we can extend the RE2-style approach to service many of these uses without compromising its goals.

#### Semantics

Matching semantics should roughly corresponding to [UTS #18 Level 2](UTS #18: Unicode Regular Expressions) of Unicode support. That is, “any character” would be any Swift Character (a grapheme), matching obeys canonical equivalence, etc. This is consistent with String’s overall philosophy of universality and harm reduction. This does require that the implementation use Unicode rules and tables provided at run time, e.g. for grapheme breaking.

+1. Of course we’d fastpath ASCII.

#### Implementation

With static knowledge, many regexes can be compiled into [efficient, minimal state machines](https://swtch.com/~rsc/regexp/regexp1.html\) (albeit with Unicode rules and tables as input at run time). [Virtual machine approaches](https://swtch.com/~rsc/regexp/regexp2.html\) can make it easier to specify and guarantee behavior while also being much more extensible. They also may alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable while growing new features.

Delivering ergonomic regexes in Swift 5 should not be blocked on having an efficient implementation available immediately. However, long term, it is important to have a [predictable and consistent performance model](Home · google/re2 Wiki · GitHub) to prevent [unanticipated exponential behavior](https://en.wikipedia.org/wiki/ReDoS\). Thus, implementation concerns are relevant in scoping the initial feature set (e.g. regarding backreferences).

Agreed. We should define the model with the aim of making it easily optimizable. Once the model is defined and implemented, getting compiler-generated DFA/NFA implementations would be a great intern project.

Some regexes are amenable to task-parallel strategies (i.e. threads), data-parallel strategies (i.e. vector or GPU), or both. Swift will probably grow new strategies and combinations of strategies over time. Implementations can also take advantage of the aforementioned performance flags (e.g. isCanonical) to greatly speed up matches.

The interface to a regex engine will be part of Swift’s ABI, including whatever intermediate representation regexes are compiled into. This could be a simple regex AST, a virtual machine bytecode, etc., but it needs to be extensible to accommodate any potential future additions and implementation strategies.

Obviously we should define a new Turing complete bytecode format for the regex VM, then build an LLVM backend that generates it, then compile the swift compiler into it, then embed that into the Swift compiler and use it to implement a first class macro system in Swift.

Finally, a reasonable approach to a hygienic macro system.

···

On Jan 11, 2018, at 10:28 PM, Chris Lattner <clattner@nondot.org> wrote:
On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Oh wait, no we shouldn’t. :-)

-Chris

Thanks Ole, that’s definitely nicer, I’ve updated my code!

Still not exactly ergonomic though :-)

-Chris

···

On Jan 11, 2018, at 3:28 AM, Ole Begemann via swift-dev <swift-dev@swift.org> wrote:

In this particular example, you can use UInt8.init(ascii:) to make it a little nicer (no need for the charToByte function):

    func hexToInt(_ c : UInt8) -> UInt8 {
        switch c {
        case UInt8(ascii: "0")...UInt8(ascii: "9"):
            return c - UInt8(ascii: "0")
        case UInt8(ascii: "a")...UInt8(ascii: "f"):
            return c - UInt8(ascii: "a") + 10
        case UInt8(ascii: "A")...UInt8(ascii: "F"):
            return c - UInt8(ascii: "A") + 10
        default: fatalError("invalid hexadecimal character")
        }
    }

Minor style point – this should be a failable init on Int rather than a computed property on Character

i.e. Int.init?(_: Character), matching Int.init?(_: String, radix: Int = 10), only it doesn’t need the radix arg cos it’s only a single character.

If implemented inside the std lib, it can still access character’s internals, which is a reasonable thing to do for something so performance-sensitive.

···

On Jan 11, 2018, at 12:32 PM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations.

nit: Multiple small representations would slow down nearly every operation on *small* strings. That is, we would initially branch on a isSmall check, and then small strings would pay the cost of further inspection. This could also slow down String construction, depending on how aggressively we try to pack/transcode and whether we also have specialized inits.

Right. It is actually a bit more complicated than that though. Consider the Swift 4 generation of strings and arrays: because of lazy bridging each primitive operation includes the dynamic branching sequence. Beyond the dynamic performance impact of this, we end up having a really unfortunate set of tradeoffs to choose between:

1. We can leave the operations outlined in the standard library. Particularly for Array which benefits from being specialized on its element type, this is a huge perf loss.

2. We can inline the operation, but we end up inlining both sides of the operation, causing massive code bloat.

3. We can partially inline just the presumed fastpath, and leave the slow path out of line.

The problem with this when you bring it to string is that the “fastpath” would naturally be the small string case. If you make *it* have many levels of branching underneath it, not only do you slow down the fast case, but you also cause intense code bloat if you want to inline core operations into user code.

Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should not be an Any!

I still don’t understand why this would even be considered, do you have any more rationale to explain that?

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified).

If you believe it is highly biased towards this, then this strongly argues for a single word representation...

The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.

You’re making an argument about aggregate use cases on average, but it is unwise to ignore the common performance critical cases that do actually want to look at the bytes. Consider the http header parsing challenges the swift on server workgroup faced that led them to use a C implementation, for example.

One of the nice things about having a small representation for each of our backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a very fast way of conservatively checking whether we form a small string. E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, we can check whether the range holds 7 or fewer code units to avoid the ARC. I don’t know if it would be worth the attempt to detect whether it would fit in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the extra packing attempt may help branch mis-predicts, though I’m *way* out of my depth when it comes to reasoning about mis-predicts as they emerge in the wild. There is a “temporal locality of Unicode-ness” in string processing, though.

As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a small UTF-8 string could only store 5. This may not be a big deal.

Yes, I doubt that is a big deal. The increased density for international strings (at the cost of unpacking if you really do want UTF16 for some thing) seem clearly beneficial: you eliminate the arc overhead and allocation, which quickly dwarf any unpacking overhead that might happen.

As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.

I still can’t imagine any workload that would care about this, and cannot see why punishing the normal cases would be worth any cost. Can you?

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:

Understood. I don’t have a strong opinion on 1 vs 2 words, either are dramatically better than 3 :-). I’m glad you’re carefully evaluating the tradeoff.

1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.

Yes, StringRef style algorithms are a big deal, as I mentioned in my previous email, but it is also unclear if this will really be a win when shoehorned into String. The benefit of StringRef is that it is a completely trivial type (both in the SIL sense but also in the implementation sense) and all the primitive ops get inlined. Given the “all things to all people” design of String, I’m very much afraid that trying to shoehorn this into the String currency type will fail to provide significant wins and thus lead to having a separate StringRef style type anyway. Providing a StringRef style projection type that is trivial (in the Swift sense) that knows in its static type that it never owns memory seems like the best path.

By point of comparison, C++ has std::string (yes, sure, with lots of issues) but they still introduced StringRef nee std::string_view instead of wedging it in.

2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them).

Agreed, this is a big deal.

If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!

Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has coded a lot of string processing over the years and understands at least some of the tradeoffs.

3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

This seems weakly motivated: large strings can store end index in the heap allocation.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

What is the expected existential inline buffer size going to wind up being? We sized it to 3 words specifically to fit string and array. It would be great to shrink that to 2 or 1 words.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

When it comes to naming, I’m more prone to argue for “UnsafeString” rather than “UnmanagedString”, but that’s a later debate for SE ;-)

+1

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?

Michael Gottesman (+CC) is doing this sort of investigation right now, evaluating the tradeoffs of more wide-spread use of the +0 convention. My current understanding is that even with +0, many retain/release pairs cannot be eliminated without more knowledge of the callee, but I don’t know the reasons why. Hopefully he can elaborate.

The tradeoff seems to be that you’re extending the lifetime across a call, preventing it from being released earlier, and if you can’t get callee side guaranteed ownership (like the borrow checker provides) then you have to retain/release anyway.

- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.

Right, it’s a little more general.

- array has exactly the same sort of concern, why is this a string-specific problem?

I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.

I see String and Array as directly analogous in the space. The *only* difference IMO is when you get to algorithms that interpret the bytes because of frickin’ unicode.

- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

Good point. If this is more of a “UnsafeString”-like concept, there’s analogy with Unsafe[Mutable]BufferPointer.

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

Yes, though you would lose the ability to give them to APIs expecting a String when you know it’s safe to do so. E.g., when the contents are effectively immortal relative to the API call and effects. Careful use of “unsafe” or “unmanaged” in type names and argument labels needed here.

Yes, I think that your point about Unsafe is important, which is even more reason not to wedge it into String. I really do understand the tradeoff of how you can’t pass a “Ref” to something that takes a String, but we already extensively debated that topic and accepted it w.r.t. Slice types. This seems directly analogous.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.

Bridging is either “always eager” or “possibly lazy”, there is nothing usefully in between. Lazy bridging means that every operations down to the lowly ascii parsing operations need to check for the slow path. I understand that this may not be possible to eliminate on apple platforms, but it would be great to not add the existential generalization since that would affect other platforms that don’t have the legacy api concern.

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:

We resisted adding single quoted strings for characters early on because we felt that it was a marginal case and might want to use them for additional quoting characters (like scripting languages like Perl that allow either single or double quotes) or for multi-line strings. However, now that multiline strings are settled and now that it is clear that we’re not going to add multiple redundant ways to quote strings, I think it is worth considering whether we should introduce ExpressibleByCharacterLiteral types and whether UInt8 should be one (along with Character of course) and tie it to single quotes.

-Chris

···

On Jan 11, 2018, at 12:32 PM, Michael Ilseman <milseman@apple.com> wrote:

Hi Chris!

+CC Michael Gottesman, as I veer into talking about ARC.

(A gist-formatted version of this email can be found at
https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and
consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine
processing, it might as well be fast. The current ordering differs on
Darwin and Linux platforms, with Linux in particular suffering from poor
performance due to choice of ordering (UCA with DUCET) and older versions
of ICU. Instead, [String Comparison Prototype](
https://github.com/apple/swift/pull/12115\) provides a simpler ordering
that allows for many common-case fast-paths and optimizations. For all the
Unicode enthusiasts out there, this is the lexicographical ordering of
NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags
and discriminators (including discriminators for which small form a String
is in), 120 bits are available for a small string payload. 120 bits can
hold 7 UTF-16 code units, which is sufficient for most graphemes and many
common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code
units without any packing, which suffices for many programmer/system
strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as
64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in
base-10, which would require 18, 19, and 24 characters respectively. 120
bits with a 5-bit encoding would fit all of these. This would speed up the
creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and
different representations slows down nearly *every* operation on string
because they have to check and detangle all of the possible representations.

nit: Multiple small representations would slow down nearly every operation
on *small* strings. That is, we would initially branch on a isSmall check,
and then small strings would pay the cost of further inspection. This could
also slow down String construction, depending on how aggressively we try to
pack/transcode and whether we also have specialized inits.

Given the ability to hold 15 digits of ascii, I don’t see why it would be
worthwhile to worry about a 5-bit representation for digits. String should
be an Any!

The tradeoff here is that you’d be biasing the design to favor creation
and interpolation of many strings containing *large* numbers, at the cost
of general string performance anywhere. This doesn’t sound like a good
tradeoff for me, particularly when people writing extremely performance
sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and
interpretation of small bit patterns can remain behind a resilience
barrier, new forms could be added in the future. However, the performance
impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design
where you have exactly one small string representation: a sequence of 15
UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii
characters on average than 7 UTF16 codepoints, and only needs one
determinator branch on the entry point of hot functions that want to touch
the bytes. If you have a lot of algorithms that are sped up by knowing
they have ascii, you could go with two bits: “is small” and “isascii”,
where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This
is my (perhaps flawed!) reasoning as to why, in general, it may be useful
to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than
they are read (and strings are read far more often than they are modified).
The “high-order bit” for performance is whether a string is small-or-not,
as avoiding an allocation and, just as importantly, managing the memory
with ARC is skipped. Beyond that, we’re debating smaller performance
deltas, which are still important. This reasoning also influences where we
choose to stick our resilience barriers, which can be relaxed in the future.

One of the nice things about having a small representation for each of our
backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a
very fast way of conservatively checking whether we form a small string.
E.g., if we’re forming a Substring over a portion of a UTF-16-backed
string, we can check whether the range holds 7 or fewer code units to avoid
the ARC. I don’t know if it would be worth the attempt to detect whether it
would fit in 15 transcoded UTF-8 code units vs bumping the ref count.
Avoiding the extra packing attempt may help branch mis-predicts, though I’m
*way* out of my depth when it comes to reasoning about mis-predicts as they
emerge in the wild. There is a “temporal locality of Unicode-ness” in
string processing, though.

As far as what strings a 7 code unit UTF-16 small form that couldn’t fit
in 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a
small UTF-8 string could only store 5. This may not be a big deal.

As to a 5-bit encoding, again, all details pending more experimentation.
Its importance may be diminished by more efficient, low-level String
construction interfaces. The difference between 15 and 24 characters could
be big, considering that formatted addresses and Doubles fit in that range.
We’ll see.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are
we really destined to have 2-words? That’s still much better than the 3
words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some
arguments in favor of 2-word, presented roughly in order of impact:

1. This allows the String type to accommodate llvm::StringRef-style
usages. This is pretty broad usage: “mmap a file and treat its contents as
a String”, “store all my contents in an llvm::BumpPtr which outlives uses”,
un-owned slices, etc. One word String would greatly limit this to only
whole-string nul-terminated cases.

2. Two-word String fits more small strings. Exactly where along the
diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the
data set. One example is NSString, which (according to reasoning at
https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\)
considered it important enough to have 6- and 5- bit reduced ASCII
character sets to squeeze up to 11-length strings in a word. 15 code unit
small strings would be a super-set of tagged NSStrings, meaning we could
bridge them eagerly in-line, while 7 code unit small strings would be a
subset (and also a strong argument against eagerly bridging them).

If you have access to any interesting data sets and can report back some
statistics, that would be immensely helpful!

3. More bits available to reserve for future-proofing, etc., though many
of these could be stored in the header.

4. The second word can cache useful information from large strings.
`endIndex` is a very frequently requested computed property and it could be
stored directly in-line rather than loaded from memory (though perhaps a
load happens anyways in a subsequent read of the string). Alternatively, we
could store the grapheme count or some other piece of information that we’d
otherwise have to recompute. More experimentation needed here.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words
doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a
common heap alignment, stack alignment, vector-width, double-word-load
width, etc.. 1-word Strings may be under-utilizing available resources,
that is the second word will often be there for use anyways. The main case
where this is not true and 1-word shines is aggregates of String.

I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2
word is currently winning. Again, up to further investigation and debate.
FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style
pointer+length+capacity, but I haven’t looked into their String vs OsString
vs OsStr vs … model.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is
known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but
can also be a huge source of UB if used wrong.. :-(

When it comes to naming, I’m more prone to argue for “UnsafeString” rather
than “UnmanagedString”, but that’s a later debate for SE ;-)

As such, they don’t need to participate in reference counting. In the
future, perhaps with ownership annotations or sophisticated static
analysis, Swift could convert managed strings into unmanaged ones as an
optimization when it knows the contents would not escape. Similarly in the
future, a String constructed from a Substring that is known to not outlive
the Substring’s owner can use an unmanaged form to skip copying the
contents. This would benefit String’s status as the recommended
common-currency type for API design.

This could also have implications for StaticString.

Yup!

Some kind of unmanaged string construct is an often-requested feature for
highly performance-sensitive code. We haven’t thought too deeply about how
best to surface this as API and it can be sliced and diced many ways. Some
considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having
the ABI pass string arguments as +0 and guarantee their lifetime across
calls? What is the tradeoff of that?

Michael Gottesman (+CC) is doing this sort of investigation right now,
evaluating the tradeoffs of more wide-spread use of the +0 convention. My
current understanding is that even with +0, many retain/release pairs
cannot be eliminated without more knowledge of the callee, but I don’t know
the reasons why. Hopefully he can elaborate.

- does this concept even matter if/when we can write an argument as
“borrowed String”? I suppose this is a bit more general in that it should
be usable as properties and other things that the (initial) ownership
design isn’t scoped to support since it doesn’t have lifetimes.

Right, it’s a little more general.

- array has exactly the same sort of concern, why is this a
string-specific problem?

I think it’s interesting as slices may emerge more often in practice on
strings than arrays, e.g. for presenting parsing results. Also,
applications that care about this level of performance may custom-allocate
all their string data contiguously (e.g. llvm::BumpPtr). I believe that
projects such as llbuild have done this, forcing themselves into an API
schism between String and UnsafeBufferPointer<UInt8>, often duplicating
functionality. In theory all of this could also apply to array, but it
seems to happen more for strings.

- how does this work with mutation? Wouldn’t we really have to have
something like MutableArrayRef since its talking about the mutability of
the elements, not something that var/let can support?

Good point. If this is more of a “UnsafeString”-like concept, there’s
analogy with Unsafe[Mutable]BufferPointer.

When I think about this, it seems like an “non-owning *slice*” of some
sort. If you went that direction, then you wouldn’t have to build it into
the String currency type, just like it isn’t in LLVM.

Yes, though you would lose the ability to give them to APIs expecting a
String when you know it’s safe to do so. E.g., when the contents are
effectively immortal relative to the API call and effects. Careful use of
“unsafe” or “unmanaged” in type names and argument labels needed here.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI
prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C
bridging.

This is probably best as another perf flag.

I was thinking that if we always zero-fill our storage classes and reserve
one extra character, then the flag would be redundant with the
discriminator bits. However, that’s assuming the string doesn’t store a nul
character in the middle. I agree this might need a separate perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as
an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C?
Can we get to an eager bridging model? That alone would dramatically
simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could
happen for most common occurrences.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a
chance.

That said, just today I had to write the code below and the ‘charToByte’
part of it is absolutely tragic. Is there any thoughts about how to
improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") +
10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") +
10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s
what I did recently when I wanted something similar, using unicode scalar
literal ranges:

extension Character {
    public var isASCII: Bool {
        guard let scalar = unicodeScalars.first, unicodeScalars.count == 1
else { return false }
        return scalar.value <= 0x7f
    }
    public var isASCIIAlphaNumeric: Bool {
        guard isASCII else { return false }
        switch unicodeScalars.first! {
        case "0"..."9": return true
        case "A"..."Z": return true
        case "a"..."z": return true
        default: return false
        }
    }
}

+1 to UnicodeScalars. I've found that in the cases where you can't
guarantee/don't know whether the underlying string storage is ASCII or
UTF-16, if performance is a concern, accessing the scalars gives you the
lowest cost conversions (ASCII -> Scalar is a simple promotion always, and
UTF-16 -> Scalar is a simple promotion in all but the relatively
small cases where you have a surrogate pair). On the other hand,
transcoding to UTF-8 can be more costly if all you want to do is compare
the Unicode code points numerically anyway.

Using UnicodeScalars here also has the advantage of looking cleaner in the
source code (like those ranges above) thanks to
ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have.

For a more general solution, I think a `var numericValue: Int? { get }` on
Character would make sense. Unicode defines (at least one) semantics for
this and ICU provides this functionality already.

I threw together a small library recently that adds extensions to
UnicodeScalar for most of the properties exposed by ICU and I would *love*
to see these make become first-class:
https://github.com/allevato/icu-swift/tree/master/Sources/ICU\. In other
words I'm happy to turn bits of this into proposals and refactor into
implementation PRs :)

···

On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev < swift-dev@swift.org> wrote:

On Jan 10, 2018, at 9:29 PM, Chris Lattner <clattner@nondot.org> wrote:
On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev < > swift-dev@swift.org> wrote:

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal
characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

I wouldn’t overly rely on it for guidance on these issues give that it it stuck so squarely in the realm of UTF16.

Wading a little into the weeds here, CharacterSet’s builtins model older Unicode semantics than what we probably want to provide. E.g. CharacterSet.lowercaseLetters means general category “Ll”, while modern Unicode defines property Lowercase which includes more scalars. The Lowercase property vs CharacterSet.lowercaseLetters is equivalent to ICU’s u_isULowercase [1] vs u_islower [2], where the former is the modern recommended API. Contrarily, Perl 6’s <lower> built-in rule[3] seems to also just be “Ll” and I’m not familiar with the rationale there (compatibility?). Certainly needs investigation.

Makes sense. It would be great to standardize towards the grapheme cluster as the canonical concept of character (this is why Character is defined that way) and by pushing regex’s to use that definition of character, we can simplify and reduce aggregate complexity. This is a definition that “just works” most often, and the introduction of regex’s will eliminate all of the awkardness of having to use them - by eliminating some of the most common reasons that people want to use integer indexes into strings.

I supposed by “guidance” I meant “here are some of the things people commonly ask about characters”. Built-in character classes in various regex flavors also provide guidance.

[1] u_isULowercase: ICU-docs - International Components for Unicode Docs | icu-docs
[2] u_islower: ICU-docs - International Components for Unicode Docs | icu-docs
[3] GitHub - Raku/old-design-docs: Raku language design documents

One of the frustrating things about the regex state of the art is that most engines were built pre-modern-unicode-complexity. As you say, it is surprising that Perl6 didn’t fix some of this, but it could either be because of compatibility or (more likely) that they made many of these decisions 15 years ago when Perl 6 was just getting going. Perl 6 had a long gestation period.

Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.

No it isn’t. A Grapheme-cluster based analog of CharacterSet would be a reasonable model to consider though. It could conceptually support predicates like isEmoji()

In the regex section I talk about character classes as being modeled by `(Character) -> Bool` rather than something that conforms to SetAlgebra. One big reason (beyond being awesomely convenient) is because I don’t think graphemes are something to be enumerated.

+1

Theoretically speaking, I’m not sure if the set of graphemes is even recursively enumerable (e.g. zalgo-text or zalgo-emoji). Practically speaking, it darn-well might as well be uncountable. Best case, if even possible, any enumeration would be emergent behavior of encoding details and change significantly with each Unicode version. If we cannot enumerate graphemes, we cannot answer equality, isDisjoint, isSubset, etc.

+2

I don’t know if those operations are important for a grapheme analogue. They just demonstrate that a grapheme set follows a different model than a scalar set, and different than what we have traditionally used the word “Set” for in Swift, though you would know the history here better.

+3. I completely agree that a “set” in the computer sciency usage is no longer a useful notion any longer for characters. It is still technically correct with the mathematical definition, but any alignment of terminology towards this will just be confusing to users.

As far as uses of such a grapheme set are concerned, they seem equivalent to application of a regex consisting of a character class. For example, rangeOfCharacter(from:options:) could be accomplished with the below.

Yes, that’s a great way to consider it.

In an ever-futile attempt to avoid focusing on specific syntax, I’ll form an army of straw-people, using 「」 delimiters for a regex literal, « » to denote character classes, subscript on String to model application of a Regex, parenthesis to denote a capture that must always be a decl (even if the name is dropped by using `_`), and Regex literals defaulting to whole-string matching rather than the first partial-string match:

extension Character {
    var isEmoji: Bool { get }
}
extension Regex {
    var rangeOfMatch: Regex<(Range<String.Index>)> // Drops the captured value, just wants the range
    var allMatches: Regex<LazySequence<T>>
}

let emojiPattern =「(let _ = «.isEmoji»)」 // Regex<(Character)>

I assume that we will support an escape sequence of some sort and use the \ character for it. Given that, it makes sense to use \(.isEmoji) as the syntax for referring to user-defined predicates, given that we use it for string literal interpolation, and that regex’s are the dual of it.

let theFirst = myString[emojiPattern.rangeOfMatch.firstPartial] // Range<String.Index>
let allOfThem = myString[emojiPattern.rangeOfMatch.allMatches] // LazySequence<Range<String.Index>>

(Alternatively, Regex could have an init taking a (Character) -> Bool)

In such a world, what do you image the role of a grapheme set type is? It might have more uses as provided by the platform, alongside things such as localization, linguistic analysis, etc.

I think you’re arguing here that CharacterSet should go away. If that is possible, then that would be great. I’d check to see how CharacterSet is used across Cocoa, I’m not an expert on that. It would be fantastic if those could be replaced with versions that take predicate functions or regex’s.

* Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).

The original (2013 era) idea was to allow for interpolants to take multiple arguments (the contents of \(….) would be passed as the argument list, so you could specify formatting information as subsequent parameters, e.g.:

    “your file access is set to \(mode, .Octal)”.

or:
    “your file access is set to \(mode, format: .Octal)”.
  
or something like that. Of course each type would define its own formatting modes, and this would probably be reflected through a type-specific initializer.

That looks very similar to one of Brent’s proposals (that I can no longer find a link to). Was there a reason it didn’t happen other than not getting around to it?

It simply wasn’t the priority at the time (e.g. we didn’t even have let vs var yet :-). There were bigger fish to fry and we didn’t come back to it.

### Regex: Tearing Strings Apart

This is clearly necessary but also clearly not part of Swift 5. Regex’s should themselves be the subject of an intense design process. Swift 6 pretty please??? I’d suggest splitting this section out and starting a regex manifesto. :-)

Since it is so big, I feel like if we’re going to make any progress we need to start soon. Of course, any focus will be preempted as needed by ABI stability. Even if Swift N doesn’t get the fully formed feature, we’ll pick up things along the way. E.g. as you mention there’s a parity between Character’s Bool properties and character classes.

I definitely want to circulate and discuss these ideas a little bit before writing a “manifesto”.

K, I’m just saying that this is meaty enough and should continue to develop, so I think that it being a manifesto is an inevitability. :-) I’m thrilled you’re pushing this forward btw.

-Chris

···

On Jan 13, 2018, at 10:30 AM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

Hi Chris!

+CC Michael Gottesman, as I veer into talking about ARC.

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations.

nit: Multiple small representations would slow down nearly every operation on *small* strings. That is, we would initially branch on a isSmall check, and then small strings would pay the cost of further inspection. This could also slow down String construction, depending on how aggressively we try to pack/transcode and whether we also have specialized inits.

Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified). The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.

One of the nice things about having a small representation for each of our backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a very fast way of conservatively checking whether we form a small string. E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, we can check whether the range holds 7 or fewer code units to avoid the ARC. I don’t know if it would be worth the attempt to detect whether it would fit in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the extra packing attempt may help branch mis-predicts, though I’m *way* out of my depth when it comes to reasoning about mis-predicts as they emerge in the wild. There is a “temporal locality of Unicode-ness” in string processing, though.

As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a small UTF-8 string could only store 5. This may not be a big deal.

As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:

1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.

2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them).

If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!

3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2 word is currently winning. Again, up to further investigation and debate. FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style pointer+length+capacity, but I haven’t looked into their String vs OsString vs OsStr vs … model.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

When it comes to naming, I’m more prone to argue for “UnsafeString” rather than “UnmanagedString”, but that’s a later debate for SE ;-)

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Yup!

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?

Michael Gottesman (+CC) is doing this sort of investigation right now, evaluating the tradeoffs of more wide-spread use of the +0 convention. My current understanding is that even with +0, many retain/release pairs cannot be eliminated without more knowledge of the callee, but I don’t know the reasons why. Hopefully he can elaborate.

- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.

Right, it’s a little more general.

- array has exactly the same sort of concern, why is this a string-specific problem?

I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.

- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

Good point. If this is more of a “UnsafeString”-like concept, there’s analogy with Unsafe[Mutable]BufferPointer.

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

Yes, though you would lose the ability to give them to APIs expecting a String when you know it’s safe to do so. E.g., when the contents are effectively immortal relative to the API call and effects. Careful use of “unsafe” or “unmanaged” in type names and argument labels needed here.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

I was thinking that if we always zero-fill our storage classes and reserve one extra character, then the flag would be redundant with the discriminator bits. However, that’s assuming the string doesn’t store a nul character in the middle. I agree this might need a separate perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:

extension Character {
    public var isASCII: Bool {
        guard let scalar = unicodeScalars.first, unicodeScalars.count == 1 else { return false }
        return scalar.value <= 0x7f
    }
    public var isASCIIAlphaNumeric: Bool {
        guard isASCII else { return false }
        switch unicodeScalars.first! {
        case "0"..."9": return true
        case "A"..."Z": return true
        case "a"..."z": return true
        default: return false
        }
    }
}

+1 to UnicodeScalars. I've found that in the cases where you can't guarantee/don't know whether the underlying string storage is ASCII or UTF-16, if performance is a concern, accessing the scalars gives you the lowest cost conversions (ASCII -> Scalar is a simple promotion always, and UTF-16 -> Scalar is a simple promotion in all but the relatively small cases where you have a surrogate pair). On the other hand, transcoding to UTF-8 can be more costly if all you want to do is compare the Unicode code points numerically anyway.

Using UnicodeScalars here also has the advantage of looking cleaner in the source code (like those ranges above) thanks to ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have.

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

I threw together a small library recently that adds extensions to UnicodeScalar for most of the properties exposed by ICU and I would *love* to see these make become first-class: https://github.com/allevato/icu-swift/tree/master/Sources/ICU\. In other words I'm happy to turn bits of this into proposals and refactor into implementation PRs :)

That looks awesome!

I think a few general APIs for querying categories/properties would satisfy most of the expert-level needs, and it would be highly valuable to provide frequently used (and easily explained) properties. For the stdlib, these may not be as exhaustive as your package, balancing usefulness and discoverability against API bloat. There would always be a place for an exhaustive/convenience package for experts and enthusiasts.

···

On Jan 11, 2018, at 2:06 PM, Tony Allevato <tony.allevato@gmail.com> wrote:
On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

On Jan 10, 2018, at 9:29 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:
On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

Hi Chris!

+CC Michael Gottesman, as I veer into talking about ARC.

(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f\)

I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved.

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115\) provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

Thank you for fixing this. Your tradeoffs make perfect sense to me.

### Small String Optimization

..

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations.

nit: Multiple small representations would slow down nearly every operation on *small* strings. That is, we would initially branch on a isSmall check, and then small strings would pay the cost of further inspection. This could also slow down String construction, depending on how aggressively we try to pack/transcode and whether we also have specialized inits.

Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified). The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.

One of the nice things about having a small representation for each of our backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a very fast way of conservatively checking whether we form a small string. E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, we can check whether the range holds 7 or fewer code units to avoid the ARC. I don’t know if it would be worth the attempt to detect whether it would fit in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the extra packing attempt may help branch mis-predicts, though I’m *way* out of my depth when it comes to reasoning about mis-predicts as they emerge in the wild. There is a “temporal locality of Unicode-ness” in string processing, though.

As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a small UTF-8 string could only store 5. This may not be a big deal.

As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:

1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.

2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them).

If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!

3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2 word is currently winning. Again, up to further investigation and debate. FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style pointer+length+capacity, but I haven’t looked into their String vs OsString vs OsStr vs … model.

### Unmanaged Strings

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses.

Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-(

When it comes to naming, I’m more prone to argue for “UnsafeString” rather than “UnmanagedString”, but that’s a later debate for SE ;-)

As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

This could also have implications for StaticString.

Yup!

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that?

Michael Gottesman (+CC) is doing this sort of investigation right now, evaluating the tradeoffs of more wide-spread use of the +0 convention. My current understanding is that even with +0, many retain/release pairs cannot be eliminated without more knowledge of the callee, but I don’t know the reasons why. Hopefully he can elaborate.

- does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes.

Right, it’s a little more general.

- array has exactly the same sort of concern, why is this a string-specific problem?

I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.

- how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support?

Good point. If this is more of a “UnsafeString”-like concept, there’s analogy with Unsafe[Mutable]BufferPointer.

When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM.

Yes, though you would lose the ability to give them to APIs expecting a String when you know it’s safe to do so. E.g., when the contents are effectively immortal relative to the API call and effects. Careful use of “unsafe” or “unmanaged” in type names and argument labels needed here.

### Performance Flags

Nice.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.

This is probably best as another perf flag.

I was thinking that if we always zero-fill our storage classes and reserve one extra character, then the flag would be redundant with the discriminator bits. However, that’s assuming the string doesn’t store a nul character in the middle. I agree this might need a separate perf flag.

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.

## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a chance.

That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:

extension Character {
    public var isASCII: Bool {
        guard let scalar = unicodeScalars.first, unicodeScalars.count == 1 else { return false }
        return scalar.value <= 0x7f
    }
    public var isASCIIAlphaNumeric: Bool {
        guard isASCII else { return false }
        switch unicodeScalars.first! {
        case "0"..."9": return true
        case "A"..."Z": return true
        case "a"..."z": return true
        default: return false
        }
    }
}

+1 to UnicodeScalars. I've found that in the cases where you can't guarantee/don't know whether the underlying string storage is ASCII or UTF-16, if performance is a concern, accessing the scalars gives you the lowest cost conversions (ASCII -> Scalar is a simple promotion always, and UTF-16 -> Scalar is a simple promotion in all but the relatively small cases where you have a surrogate pair). On the other hand, transcoding to UTF-8 can be more costly if all you want to do is compare the Unicode code points numerically anyway.

Using UnicodeScalars here also has the advantage of looking cleaner in the source code (like those ranges above) thanks to ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have.

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

I threw together a small library recently that adds extensions to UnicodeScalar for most of the properties exposed by ICU and I would *love* to see these make become first-class: https://github.com/allevato/icu-swift/tree/master/Sources/ICU\. In other words I'm happy to turn bits of this into proposals and refactor into implementation PRs :)

That looks awesome!

I think a few general APIs for querying categories/properties would satisfy most of the expert-level needs, and it would be highly valuable to provide frequently used (and easily explained) properties. For the stdlib, these may not be as exhaustive as your package, balancing usefulness and discoverability against API bloat. There would always be a place for an exhaustive/convenience package for experts and enthusiasts.

I forgot to say: Yes, please make a pitch and drive a proposal! :-)

···

On Jan 11, 2018, at 2:17 PM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote:

On Jan 11, 2018, at 2:06 PM, Tony Allevato <tony.allevato@gmail.com <mailto:tony.allevato@gmail.com>> wrote:
On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

On Jan 10, 2018, at 9:29 PM, Chris Lattner <clattner@nondot.org <mailto:clattner@nondot.org>> wrote:
On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

  var result: [UInt8] =

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

Minor style point – this should be a failable init on Int rather than a computed property on Character

i.e. Int.init?(_: Character), matching Int.init?(_: String, radix: Int = 10), only it doesn’t need the radix arg cos it’s only a single character.

`Int.init?` is probably better, yes. If you wanted to take that to its logical conclusion, that would include `Double.init?` for mathematical constants and some `Rational.init` for fraction graphemes. These are pretty far off the deep end ;-)

Minor logic point – radix isn't important, but that's not because it’s only a single character but rather because of how Unicode(/ICU) defines character properties. Numbers can be much higher than 9, e.g. 万 is ten-thousand. Even worse is cuneiform, which is sexagesimal (and of course cuneiform is properly encoded in Unicode!). Luckily, Unicode’s numeric values are always presented in base-10, AFAICT.

···

On Jan 11, 2018, at 4:20 PM, Ben Cohen <ben_cohen@apple.com> wrote:

On Jan 11, 2018, at 12:32 PM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

If implemented inside the std lib, it can still access character’s internals, which is a reasonable thing to do for something so performance-sensitive.

We are planning to reevaluate the size of the inline buffer based on experimental performance data, but we can’t do that in a useful way until the size of String has been settled.

···

On Jan 11, 2018, at 9:46 PM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.

<repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:

Understood. I don’t have a strong opinion on 1 vs 2 words, either are dramatically better than 3 :-). I’m glad you’re carefully evaluating the tradeoff.

1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.

Yes, StringRef style algorithms are a big deal, as I mentioned in my previous email, but it is also unclear if this will really be a win when shoehorned into String. The benefit of StringRef is that it is a completely trivial type (both in the SIL sense but also in the implementation sense) and all the primitive ops get inlined. Given the “all things to all people” design of String, I’m very much afraid that trying to shoehorn this into the String currency type will fail to provide significant wins and thus lead to having a separate StringRef style type anyway. Providing a StringRef style projection type that is trivial (in the Swift sense) that knows in its static type that it never owns memory seems like the best path.

By point of comparison, C++ has std::string (yes, sure, with lots of issues) but they still introduced StringRef nee std::string_view instead of wedging it in.

2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html\) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them).

Agreed, this is a big deal.

If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!

Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has coded a lot of string processing over the years and understands at least some of the tradeoffs.

3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

This seems weakly motivated: large strings can store end index in the heap allocation.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

What is the expected existential inline buffer size going to wind up being? We sized it to 3 words specifically to fit string and array. It would be great to shrink that to 2 or 1 words.

(Replying out-of-order)

Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has coded a lot of string processing over the years and understands at least some of the tradeoffs.

Wouldn’t it be amazing to be able to write a high performance lexer with all the awesomeness of Swift and without String getting in your way? :-)

^ String should NOT be an Any!

<snip>

Bridging is either “always eager” or “possibly lazy”, there is nothing usefully in between. Lazy bridging means that every operations down to the lowly ascii parsing operations need to check for the slow path. I understand that this may not be possible to eliminate on apple platforms, but it would be great to not add the existential generalization since that would affect other platforms that don’t have the legacy api concern.

<snip>

Yes, StringRef style algorithms are a big deal, as I mentioned in my previous email, but it is also unclear if this will really be a win when shoehorned into String. The benefit of StringRef is that it is a completely trivial type (both in the SIL sense but also in the implementation sense) and all the primitive ops get inlined. Given the “all things to all people” design of String, I’m very much afraid that trying to shoehorn this into the String currency type will fail to provide significant wins and thus lead to having a separate StringRef style type anyway. Providing a StringRef style projection type that is trivial (in the Swift sense) that knows in its static type that it never owns memory seems like the best path.

By point of comparison, C++ has std::string (yes, sure, with lots of issues) but they still introduced StringRef nee std::string_view instead of wedging it in.

I left the “UnmangedString” section open for debate on SE, but for the purposes of talking about ABI design (i.e. what is possible to expose in the future), let’s just say we’ll eventually have an `UnsafeString<CodeUnit>`, which is a pointer, length, and packed performance flags that provides both a random access view of its code units as well as full-Unicode operations.

Please bear with me as I try to reason through what String should be incorporating the current reality of source compatibility. If you have any design insight or historical context to fill in, please do so!

In one extreme corner of this multi-dimensional design space we have a proliferation of string types such as UTF8String, UTF16String, and even combinations of performance flags such as TrivialUTF8String, CanonicalSingleScalarGraphemeSmallString, etc. This proliferation encodes all these would-be-dynamic-branches in the type, which is great for performance. But, this hits APIs hard, where APIs either accept total chaos by taking concrete types, or try to generalize/erase over these myriad types.

On the other extreme end is AnyString, an existential that can be backed by anything that can vend code units in some fashion. The advantage of such a type is API unification: no matter what form someone has, they can always call your API without copying the underlying data. The downside is that everyone’s in the same slow boat.

Somewhere between these two extremes will be String. The question is, what details does String treat as dynamic properties? Source compatibility dictates that String abstracts away the underlying encoding, branching inside every API that performs a read of its code units. There will always be performance-critical use cases that wish to avoid these branches, i.e. branch at the top rather than throughout. A more appropriate type is needed for this, such as UTF16String, UTF8String, and UnsafeString<CodeUnit>. Treating encoding as a dynamic property is great for APIs, as it avoids an encoding schism.

If we do small string optimizations for the String type, then every String could either be small (a value) or large (a reference). Every use of large strings will be slightly penalized. To a very small extent, even performance flags slightly penalizes strings that doesn’t set them. These are worth doing nonetheless as the benefit is so large when they apply and the penalty is paltry relative to cost of the operations in this slow path.

(...continued after quote)

Right. It is actually a bit more complicated than that though. Consider the Swift 4 generation of strings and arrays: because of lazy bridging each primitive operation includes the dynamic branching sequence. Beyond the dynamic performance impact of this, we end up having a really unfortunate set of tradeoffs to choose between:

1. We can leave the operations outlined in the standard library. Particularly for Array which benefits from being specialized on its element type, this is a huge perf loss.

2. We can inline the operation, but we end up inlining both sides of the operation, causing massive code bloat.

3. We can partially inline just the presumed fastpath, and leave the slow path out of line.

The problem with this when you bring it to string is that the “fastpath” would naturally be the small string case. If you make *it* have many levels of branching underneath it, not only do you slow down the fast case, but you also cause intense code bloat if you want to inline core operations into user code.

<snip>

The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.

You’re making an argument about aggregate use cases on average, but it is unwise to ignore the common performance critical cases that do actually want to look at the bytes.

That is a very good point, and it brings up some of the tradeoffs we need to balance in a data-driven fashion. This reminds me of the classic throughput vs latency view of performance. We have many, many strings on our systems and anything that reduces overall resource usage is extremely valuable. On the other hand, some operations need to be as fast as possible and shouldn’t be heavily penalized for some aggregate win.

String with small string optimizations have “value” forms and “reference” forms, from a runtime and ABI perspective. Evaluating adding more value forms involves answering these questions:

1. What is the potential gain for accommodating this additional value form? This is probably a “throughput” gain.
2. What is the cost to the fast path, and to the slow path, of accommodating this additional value form? That is, how does this hurt “latency”?

Let’s say “fast path” means the small UTF-8 form, and “slow path” is the out-of-line encoding-abstracted form. The cost to the fast path could be reduced if we recognize that one particular value form (e.g. small UTF-8 string) is worth checking first and all other value forms can be handled by a call to some non-inlined function. For the fast path, it’s the difference between checking if the top bit is 1 vs the top n bits being 1, which is a couple extra instructions (on x86_64 at least).

The cost to the slow path is checking and branching for other small forms. This may be significant or insignificant, depending on the operation. E.g. meaningless compared to general grapheme breaking, but could be significant for `endIndex`.

As to String being able to accommodate “Any”-like usage, it would not impact the fast path (not a value form). It would have a similar impact on the slow path as multiple value forms, and would need similar justification.

Consider the http header parsing challenges the swift on server workgroup faced that led them to use a C implementation, for example.

That’s very unfortunate. Highly-performance-sensitive use cases have largely been neglected by String, and it’s very important to consider their needs in the ABI. Do you know of a retrospective or post-mortem of the attempt?

HTTP header parsing probably wouldn’t want to use String, which dynamically branches on reads to accommodate both UTF-8 and UTF-16. Similarly, the headers wouldn’t be small, so no need for that extra branch around inlined code there either. A “UTF8String" type would be more appropriate. They further might be known-ASCII, which would imply trivially-encoded-and-canonical UTF8String (maybe even single-scalar-grapheme if they’re known not to have Windows-style newlines), so no need to check-and-branch—just execute the fast path. For parsed portions, they probably want UnsafeString<UInt8>, or perhaps some variant that encoded perf flags too.

I think the future is bright if we get the right ABI.

Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should not be an Any!

I still don’t understand why this would even be considered, do you have any more rationale to explain that?

<snip>

As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.

I still can’t imagine any workload that would care about this, and cannot see why punishing the normal cases would be worth any cost. Can you?

I do not at this time have detailed rationale. FWIW, my plan is to roll out the small UTF-8 representation first and only add the others if there’s sufficient data and justification (i.e. cost/benefit analysis). I’m especially interested in whether more efficient ways to construct and interpolate Strings would address the perceived need for this form.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.

We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.

Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified).

If you believe it is highly biased towards this, then this strongly argues for a single word representation...

… depending on how far along the diminishing returns curve 7 vs 15 UTF-8 code units is. The killer feature of single-word String is that e.g. Array<String> is denser and since growth happens via move/take operations, grow operations are also quicker. However, an Array copy-from-a-write is subject to the tradeoff of small vs large. For Strings as arguments / return values, the win for single-word String is conditional on how many other arguments / return values there are and our calling convention (which IIRC has devoted many registers to arguments / return values). There’s also the existential inline buffer size, which I think is worth spinning off into a separate discussion.

4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.

This seems weakly motivated: large strings can store end index in the heap allocation.

Right. I was hoping that large strings can use the extra bits to recover some of the (albeit small) regression from accommodating a small form, and endIndex has a higher relative hit. But, it would probably be better to use them for something else and not try to “anchor” ourselves to large-string micro benchmarks.

5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.

What is the expected existential inline buffer size going to wind up being? We sized it to 3 words specifically to fit string and array. It would be great to shrink that to 2 or 1 words.

(Bob replied to this).

Another angle is that Substring is currently String + a 2-word range. We think it should be shrunk to String + a 1-word index-mapping offset. 2-word String would let Substring fit in the existing inline buffer. The question is whether the inline buffer should shrink with String, or if there are cases such as Substring that are still important.

- array has exactly the same sort of concern, why is this a string-specific problem?

I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.

I see String and Array as directly analogous in the space. The *only* difference IMO is when you get to algorithms that interpret the bytes because of frickin’ unicode.

<snip>

Yes, I think that your point about Unsafe is important, which is even more reason not to wedge it into String. I really do understand the tradeoff of how you can’t pass a “Ref” to something that takes a String, but we already extensively debated that topic and accepted it w.r.t. Slice types. This seems directly analogous.

The analogy between String vs Substring and Array vs Slice is interesting and makes sense. I’ll have to think about this a bit more, and whether String vs Substring might differ in practice.

···

* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.

I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:

We resisted adding single quoted strings for characters early on because we felt that it was a marginal case and might want to use them for additional quoting characters (like scripting languages like Perl that allow either single or double quotes) or for multi-line strings. However, now that multiline strings are settled and now that it is clear that we’re not going to add multiple redundant ways to quote strings, I think it is worth considering whether we should introduce ExpressibleByCharacterLiteral types and whether UInt8 should be one (along with Character of course) and tie it to single quotes.

-Chris

For a more general solution, I think a `var numericValue: Int? { get }` on Character would make sense. Unicode defines (at least one) semantics for this and ICU provides this functionality already.

Minor style point – this should be a failable init on Int rather than a computed property on Character

i.e. Int.init?(_: Character), matching Int.init?(_: String, radix: Int = 10), only it doesn’t need the radix arg cos it’s only a single character.

`Int.init?` is probably better, yes. If you wanted to take that to its logical conclusion, that would include `Double.init?` for mathematical constants and some `Rational.init` for fraction graphemes. These are pretty far off the deep end ;-)

Minor logic point – radix isn't important, but that's not because it’s only a single character but rather because of how Unicode(/ICU) defines character properties. Numbers can be much higher than 9, e.g. 万 is ten-thousand. Even worse is cuneiform, which is sexagesimal (and of course cuneiform is properly encoded in Unicode!). Luckily, Unicode’s numeric values are always presented in base-10, AFAICT.

Seems like two different features got tangled here, both of which are useful…

On the one hand, Chris’ case of “I’ve got a Character and I want to convert ASCII 0-9a-f hex into an Int and it’s incredibly painful in Swift 4”. This is a pretty common everyday need and calls for a failable initializer on Int, similar to the one that takes a String, that is very fast and only covers those characters. Thinking about it, you probably do want an optional radix argument, because you want to explicitly control whether “a” is nil or 10.

Separate different feature is exposing Unicode properties of characters, including things like :five: = 5.0, ½ = 0.5 etc, but where a is nil. This possibly works better as a property like numericValue: Double? than an init on Double.

···

On Jan 11, 2018, at 17:34, Michael Ilseman <milseman@apple.com> wrote:

On Jan 11, 2018, at 4:20 PM, Ben Cohen <ben_cohen@apple.com <mailto:ben_cohen@apple.com>> wrote:

On Jan 11, 2018, at 12:32 PM, Michael Ilseman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

If implemented inside the std lib, it can still access character’s internals, which is a reasonable thing to do for something so performance-sensitive.

@Michael_Ilseman Now that Discourse is here, perhaps you could split the various parts of this discussion into threads so we can discuss them separately? That would make it much easier to follow each separate thread.

2 Likes