[pitch] Substring performance affordances


(Ben Cohen) #1

Hi swift-evolution,

Below is a short pitch for some performance improvements to be made to String to accommodate Substrings.

As outlined in SE-0163, the more general question of implicit conversion from Substring to String was deferred pending feedback on the initial implementation. To date, the feedback we’ve received hasn’t suggested that such an implicit conversion is necessary – that migrations from 3.2 to 4.0 haven’t led to an extensive need to perform Substring->String conversion. Any further input, either on or off list, about this particular aspect of migration would be very gratefully received.

# Substring performance affordances

* Proposal: [SE-NNNN](NNNN-substring-affordances.md)
* Authors: [Ben Cohen](https://github.com/airspeedswift)
* Review Manager: TBD
* Status: **Awaiting review**

## Introduction

This proposal modifies a small number of methods in the standard library that
are commonly used with the `Substring` type:

- Modify the `init` on floating point and integer types, to construct them
  from `StringProtocol` rather than `String`.
- Change `join` to be an extension `where Element: StringProtocol`
- Add extensions to `Dictionary where Key == String` and `Set where Element ==
  String` to test for presence of a `Substring`.

## Motivation

Swift 4 introduced `Substring` as the slice type for `String`. Previously,
`String` had been its own slice type, but this leads to issues where string
buffers can be unexpectedly retained. This approach was adopted instead of the
alternative of having the slicing operation make a copy. A copying slicing
operation would have negative performance consequences, and would also conflict
with the requirement that `Collection` be sliceable in constant time. In cases
where an API requires a `String`, the user must construct a new `String` from a
`Substring`. This can be thought of as a "deferral" of the copy that was
avoided at the time of the slice.

There are a few places in the standard library where it is notably inefficient
to force a copy of a substring in order to use it with a string: performing
lookups in hashed containers, joining substrings, and converting substrings to
integers. In particular, these operations are likely to be used inside a loop
over a number of substrings extracted from a string. For example, suppose you
had a string of key/value pairs, where the values were integers and you wanted
to sum them by key. You would be forced to convert both the `Substring` keys
and values to `String` to do this.

## Proposed solution

Add the following to the standard library:

extension FixedWidthInteger {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Float/Double/Float80 {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Sequence where Element: StringProtocol {
 public func joined(separator: String = "") -> String
}

extension Dictionary where Key == String {
 public subscript(key: Substring) -> Value? { get set }
 public subscript(key: Substring, default defaultValue: @autoclosure () -> Value) -> Value { get set }
}

extension Set where Element == String {
 public func contains(_ member: Substring) -> Bool
 public func index(of member: Substring) -> Index?
 public mutating func insert(_ newMember: Substring) -> (inserted: Bool, memberAfterInsert: Element)
 public mutating func remove(_ member: Substring) -> Element?
}

These additions are deliberately narrow in scope. They are _not_ intended to
solve a general problem of being able to interchange substrings for strings (or
more generally slices for collections) generically in different APIs. See the
alternatives considered section for more on this.

## Source compatibility

No impact, these are either additive (in case of hashed containers) or
generalize an existing API to a protocol (in case of numeric
conversion/joining).

## Effect on ABI stability

The hashed container changes are additive so no impact. The switch from conrete
to generic types for the numeric conversions needs to be made before ABI
stability.

## Alternatives considered

While they have a convenience benefit as well, this is not the primary goal of
these additions, but a side-effect of helping avoid a performance problem. In
many other cases, the performance issues can be avoided via modified use e.g.
`Sequence.contains` of a `Substring` in a sequence of strings can be written as
`sequence.contains { $0 == substring }` .

These changes are limited in scope, and further additions could be considered
in the future. For example, should the `Dictionary.init(grouping:by:) where Key
== String` operation be enhanced to similarly take a sequence of substrings?
There is a long tail of these cases, and the need to keep unnecessary overloads
to a minimum, avoiding typechecker work and code bloat, must be weighed against
the likelyhood that string copies will be a performance problems.

There is a more general problem of interoperating between collections and
slices. In the future, there may be other affordances for converting/comparing
them. For example, it might be desirable to require equatable collections to
have equatable slices, and to automatically provide default implementations of
`==` that efficiently compare a collection to its default slice. These
enhancements rely on features such as conditional conformance, and so may be
worth considering in later versions of Swift but are not an option currently.


(David Hart) #2

Purely additive, so +1 from me. Side note, I’m wondering how problematic these same discussions will be in third-party library code. Should authors use StringProtocol or String as often as possible?

···

On 28 Jun 2017, at 18:37, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi swift-evolution,

Below is a short pitch for some performance improvements to be made to String to accommodate Substrings.

As outlined in SE-0163, the more general question of implicit conversion from Substring to String was deferred pending feedback on the initial implementation. To date, the feedback we’ve received hasn’t suggested that such an implicit conversion is necessary – that migrations from 3.2 to 4.0 haven’t led to an extensive need to perform Substring->String conversion. Any further input, either on or off list, about this particular aspect of migration would be very gratefully received.

# Substring performance affordances

* Proposal: [SE-NNNN](NNNN-substring-affordances.md)
* Authors: [Ben Cohen](https://github.com/airspeedswift)
* Review Manager: TBD
* Status: **Awaiting review**

## Introduction

This proposal modifies a small number of methods in the standard library that
are commonly used with the `Substring` type:

- Modify the `init` on floating point and integer types, to construct them
  from `StringProtocol` rather than `String`.
- Change `join` to be an extension `where Element: StringProtocol`
- Add extensions to `Dictionary where Key == String` and `Set where Element ==
  String` to test for presence of a `Substring`.

## Motivation

Swift 4 introduced `Substring` as the slice type for `String`. Previously,
`String` had been its own slice type, but this leads to issues where string
buffers can be unexpectedly retained. This approach was adopted instead of the
alternative of having the slicing operation make a copy. A copying slicing
operation would have negative performance consequences, and would also conflict
with the requirement that `Collection` be sliceable in constant time. In cases
where an API requires a `String`, the user must construct a new `String` from a
`Substring`. This can be thought of as a "deferral" of the copy that was
avoided at the time of the slice.

There are a few places in the standard library where it is notably inefficient
to force a copy of a substring in order to use it with a string: performing
lookups in hashed containers, joining substrings, and converting substrings to
integers. In particular, these operations are likely to be used inside a loop
over a number of substrings extracted from a string. For example, suppose you
had a string of key/value pairs, where the values were integers and you wanted
to sum them by key. You would be forced to convert both the `Substring` keys
and values to `String` to do this.

## Proposed solution

Add the following to the standard library:

extension FixedWidthInteger {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Float/Double/Float80 {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Sequence where Element: StringProtocol {
 public func joined(separator: String = "") -> String
}

extension Dictionary where Key == String {
 public subscript(key: Substring) -> Value? { get set }
 public subscript(key: Substring, default defaultValue: @autoclosure () -> Value) -> Value { get set }
}

extension Set where Element == String {
 public func contains(_ member: Substring) -> Bool
 public func index(of member: Substring) -> Index?
 public mutating func insert(_ newMember: Substring) -> (inserted: Bool, memberAfterInsert: Element)
 public mutating func remove(_ member: Substring) -> Element?
}

These additions are deliberately narrow in scope. They are _not_ intended to
solve a general problem of being able to interchange substrings for strings (or
more generally slices for collections) generically in different APIs. See the
alternatives considered section for more on this.

## Source compatibility

No impact, these are either additive (in case of hashed containers) or
generalize an existing API to a protocol (in case of numeric
conversion/joining).

## Effect on ABI stability

The hashed container changes are additive so no impact. The switch from conrete
to generic types for the numeric conversions needs to be made before ABI
stability.

## Alternatives considered

While they have a convenience benefit as well, this is not the primary goal of
these additions, but a side-effect of helping avoid a performance problem. In
many other cases, the performance issues can be avoided via modified use e.g.
`Sequence.contains` of a `Substring` in a sequence of strings can be written as
`sequence.contains { $0 == substring }` .

These changes are limited in scope, and further additions could be considered
in the future. For example, should the `Dictionary.init(grouping:by:) where Key
== String` operation be enhanced to similarly take a sequence of substrings?
There is a long tail of these cases, and the need to keep unnecessary overloads
to a minimum, avoiding typechecker work and code bloat, must be weighed against
the likelyhood that string copies will be a performance problems.

There is a more general problem of interoperating between collections and
slices. In the future, there may be other affordances for converting/comparing
them. For example, it might be desirable to require equatable collections to
have equatable slices, and to automatically provide default implementations of
`==` that efficiently compare a collection to its default slice. These
enhancements rely on features such as conditional conformance, and so may be
worth considering in later versions of Swift but are not an option currently.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


#3

Hi,

Just a quick question. Does this mean it's guaranteed that
Substring.hashValue is always equal to String.hashValue?

let foobar: String = "foo bar"
let sliced: Substring = foobar.suffix(3)
let bar = "bar"
bar.hashValue == sliced.hashValue // guaranteed?
···

2017-06-29 1:37 GMT+09:00 Ben Cohen via swift-evolution < swift-evolution@swift.org>:

Hi swift-evolution,

Below is a short pitch for some performance improvements to be made to
String to accommodate Substrings.

As outlined in SE-0163, the more general question of implicit conversion
from Substring to String was deferred pending feedback on the initial
implementation. To date, the feedback we’ve received hasn’t suggested that
such an implicit conversion is necessary – that migrations from 3.2 to 4.0
haven’t led to an extensive need to perform Substring->String conversion.
Any further input, either on or off list, about this particular aspect of
migration would be very gratefully received.

# Substring performance affordances

* Proposal: [SE-NNNN](NNNN-substring-affordances.md)
* Authors: [Ben Cohen](https://github.com/airspeedswift)
* Review Manager: TBD
* Status: **Awaiting review**

## Introduction

This proposal modifies a small number of methods in the standard library
that
are commonly used with the `Substring` type:

- Modify the `init` on floating point and integer types, to construct them
  from `StringProtocol` rather than `String`.
- Change `join` to be an extension `where Element: StringProtocol`
- Add extensions to `Dictionary where Key == String` and `Set where
Element ==
  String` to test for presence of a `Substring`.

## Motivation

Swift 4 introduced `Substring` as the slice type for `String`. Previously,
`String` had been its own slice type, but this leads to issues where string
buffers can be unexpectedly retained. This approach was adopted instead of
the
alternative of having the slicing operation make a copy. A copying slicing
operation would have negative performance consequences, and would also
conflict
with the requirement that `Collection` be sliceable in constant time. In
cases
where an API requires a `String`, the user must construct a new `String`
from a
`Substring`. This can be thought of as a "deferral" of the copy that was
avoided at the time of the slice.

There are a few places in the standard library where it is notably
inefficient
to force a copy of a substring in order to use it with a string: performing
lookups in hashed containers, joining substrings, and converting
substrings to
integers. In particular, these operations are likely to be used inside a
loop
over a number of substrings extracted from a string. For example, suppose
you
had a string of key/value pairs, where the values were integers and you
wanted
to sum them by key. You would be forced to convert both the `Substring`
keys
and values to `String` to do this.

## Proposed solution

Add the following to the standard library:

extension FixedWidthInteger {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Float/Double/Float80 {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Sequence where Element: StringProtocol {
 public func joined(separator: String = "") -> String
}

extension Dictionary where Key == String {
 public subscript(key: Substring) -> Value? { get set }
 public subscript(key: Substring, default defaultValue: @autoclosure () ->
Value) -> Value { get set }
}

extension Set where Element == String {
 public func contains(_ member: Substring) -> Bool
 public func index(of member: Substring) -> Index?
 public mutating func insert(_ newMember: Substring) -> (inserted: Bool,
memberAfterInsert: Element)
 public mutating func remove(_ member: Substring) -> Element?
}

These additions are deliberately narrow in scope. They are _not_ intended
to
solve a general problem of being able to interchange substrings for
strings (or
more generally slices for collections) generically in different APIs. See
the
alternatives considered section for more on this.

## Source compatibility

No impact, these are either additive (in case of hashed containers) or
generalize an existing API to a protocol (in case of numeric
conversion/joining).

## Effect on ABI stability

The hashed container changes are additive so no impact. The switch from
conrete
to generic types for the numeric conversions needs to be made before ABI
stability.

## Alternatives considered

While they have a convenience benefit as well, this is not the primary
goal of
these additions, but a side-effect of helping avoid a performance problem.
In
many other cases, the performance issues can be avoided via modified use
e.g.
`Sequence.contains` of a `Substring` in a sequence of strings can be
written as
`sequence.contains { $0 == substring }` .

These changes are limited in scope, and further additions could be
considered
in the future. For example, should the `Dictionary.init(grouping:by:)
where Key
== String` operation be enhanced to similarly take a sequence of
substrings?
There is a long tail of these cases, and the need to keep unnecessary
overloads
to a minimum, avoiding typechecker work and code bloat, must be weighed
against
the likelyhood that string copies will be a performance problems.

There is a more general problem of interoperating between collections and
slices. In the future, there may be other affordances for
converting/comparing
them. For example, it might be desirable to require equatable collections
to
have equatable slices, and to automatically provide default
implementations of
`==` that efficiently compare a collection to its default slice. These
enhancements rely on features such as conditional conformance, and so may
be
worth considering in later versions of Swift but are not an option
currently.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Dave Abrahams) #4

Hi swift-evolution,

Below is a short pitch for some performance improvements to be made to String to accommodate
Substrings.

As outlined in SE-0163, the more general question of implicit conversion from Substring to String
was deferred pending feedback on the initial implementation. To date, the feedback we’ve received
hasn’t suggested that such an implicit conversion is necessary – that migrations from 3.2 to 4.0
haven’t led to an extensive need to perform Substring->String conversion. Any further input, either
on or off list, about this particular aspect of migration would be very gratefully received.

# Substring performance affordances

* Proposal: [SE-NNNN](NNNN-substring-affordances.md)
* Authors: [Ben Cohen](https://github.com/airspeedswift)
* Review Manager: TBD
* Status: **Awaiting review**

## Introduction

This proposal modifies a small number of methods in the standard library that
are commonly used with the `Substring` type:

- Modify the `init` on floating point and integer types, to construct them
  from `StringProtocol` rather than `String`.

+1. This just makes good sense.

- Change `join` to be an extension `where Element: StringProtocol`

I don't see how this is implementable unless we ignore the May 10
core team resolution that RangeReplaceableCollection conformance be
removed from StringProtocol, or we add RangeReplaceableCollection to the
constraints on Element. We had good reasons for the resolution, so I
think we'd need to do the latter.

That said, I presume this is being called "joined()" per
https://github.com/apple/swift/pull/10734... I'm concerned that we
haven't sufficiently considered the forest of joined() overloads we're
expanding. For all other sequences, joined() produces a lazy result. I
realize that joining strings to produce a string is a really important
operation, but

   String(myStrings.joined())

is not so onerous as to make it an obvious choice to add this overload
to StringProtocol, especially since the other ones will still appear on
String thanks to its Sequence conformance. String's conformance to
Collection gets confusing when these kinds of things happen. Instead,
we should be considering dropping the eager overload from String for
Swift 4. Or, if we think it's important enough to avoid the explicit
conversion to string, maybe we should be thinking about making the other
joined()s eager.

- Add extensions to `Dictionary where Key == String` and `Set where
Element == String` to test for presence of a `Substring`.

Here's why I don't think we should do this:

* As noted in your introduction, we deliberately avoided taking steps
  that would make Substrings transparently work where Strings are now
  accepted so that we'd find out whether the impedance mismatch was
  going to be an issue. We've had very little time to get information
  about String/Substring interop issues that people might face and
  doing this now would mute the signal.

* The proposal's stated motivation is to work around a performance
  problem, and it does so by adding a very specific, special-purpose
  piece of API. But the performance issues are actually more general:

  - Sets and dictionaries offer no way to look up a key and, when the
    key isn't found, use the bucket information computed by the lookup
    for a subsequent insertion.

      extension Set {
        func toggle(_ k: Element) {
          let p = s.lookup(k) // <==== for example
          if p.found {
            s.remove(at: p.index)
          }
          else {
            s.insert(k, at: p)
          }
        }
      }

  - Substring offers no way to create a corresponding String that avoids
    a copy when necessary for performance reasons:

      for w: Substring in words {
        result += transform(
          String(maintainingExcessStorageBySharing: w)) // <== for example
      }

    If transform is an API that—as we recommend by default—operates on
    String rather than Substring or StringProtocol, and is known not to
    make long-term copies of its argument, users currently have no way
    to avoid the performance cost of a copy. It's not something people
    should use every day, but when you need it, you need it.

  Together these APIs can be used to solve the Substring performance
  problem with Set<String>/Dictionary<String,V> insertion, and a whole
  raft of others besides.

    let p = s.lookup(String(maintainingExcessStorageBySharing: w))
    if !p.found { s.insert(String(w), at: p)

* The proposed API adds overloads, which harms usability, and having
  these particular overloads is IMO not clearly the right answer in the
  long term. Removing APIs later is expensive in labor, harmful to
  users, and sometimes difficult technically.

* Handling these cases transparently sets up an expectation that other
  such cases should be handled transparently as well. The proposal
  itself raises many such possibilities, and the implications seem to be
  lots of API complexity creep and things that magically work in some
  cases but not in others. Even if we refuse to extend this pattern to
  other APIs, failure to meet expectations and the resulting confusion
  is a bad user experience.

In sum, we should be quite sure of this API before we add it, and I, at
least, am not. If we *do* decide to add it, we should at the same time
establish some general principles that drive its addition so that we
have a consistent basis on which to make the same kind of decision about
other APIs.

Thanks for listening,
Dave

···

on Wed Jun 28 2017, Ben Cohen <swift-evolution@swift.org> wrote:

## Motivation

Swift 4 introduced `Substring` as the slice type for `String`. Previously,
`String` had been its own slice type, but this leads to issues where string
buffers can be unexpectedly retained. This approach was adopted instead of the
alternative of having the slicing operation make a copy. A copying slicing
operation would have negative performance consequences, and would also conflict
with the requirement that `Collection` be sliceable in constant time. In cases
where an API requires a `String`, the user must construct a new `String` from a
`Substring`. This can be thought of as a "deferral" of the copy that was
avoided at the time of the slice.

There are a few places in the standard library where it is notably inefficient
to force a copy of a substring in order to use it with a string: performing
lookups in hashed containers, joining substrings, and converting substrings to
integers. In particular, these operations are likely to be used inside a loop
over a number of substrings extracted from a string. For example, suppose you
had a string of key/value pairs, where the values were integers and you wanted
to sum them by key. You would be forced to convert both the `Substring` keys
and values to `String` to do this.

## Proposed solution

Add the following to the standard library:

extension FixedWidthInteger {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Float/Double/Float80 {
 public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
}

extension Sequence where Element: StringProtocol {
 public func joined(separator: String = "") -> String
}

extension Dictionary where Key == String {
 public subscript(key: Substring) -> Value? { get set }
 public subscript(key: Substring, default defaultValue: @autoclosure () -> Value) -> Value { get set
}
}

extension Set where Element == String {
 public func contains(_ member: Substring) -> Bool
 public func index(of member: Substring) -> Index?
 public mutating func insert(_ newMember: Substring) -> (inserted: Bool, memberAfterInsert: Element)
 public mutating func remove(_ member: Substring) -> Element?
}

These additions are deliberately narrow in scope. They are _not_ intended to
solve a general problem of being able to interchange substrings for strings (or
more generally slices for collections) generically in different APIs. See the
alternatives considered section for more on this.

## Source compatibility

No impact, these are either additive (in case of hashed containers) or
generalize an existing API to a protocol (in case of numeric
conversion/joining).

## Effect on ABI stability

The hashed container changes are additive so no impact. The switch from conrete
to generic types for the numeric conversions needs to be made before ABI
stability.

## Alternatives considered

While they have a convenience benefit as well, this is not the primary goal of
these additions, but a side-effect of helping avoid a performance problem. In
many other cases, the performance issues can be avoided via modified use e.g.
`Sequence.contains` of a `Substring` in a sequence of strings can be written as
`sequence.contains { $0 == substring }` .

These changes are limited in scope, and further additions could be considered
in the future. For example, should the `Dictionary.init(grouping:by:) where Key
== String` operation be enhanced to similarly take a sequence of substrings?
There is a long tail of these cases, and the need to keep unnecessary overloads
to a minimum, avoiding typechecker work and code bloat, must be weighed against
the likelyhood that string copies will be a performance problems.

There is a more general problem of interoperating between collections and
slices. In the future, there may be other affordances for converting/comparing
them. For example, it might be desirable to require equatable collections to
have equatable slices, and to automatically provide default implementations of
`==` that efficiently compare a collection to its default slice. These
enhancements rely on features such as conditional conformance, and so may be
worth considering in later versions of Swift but are not an option currently.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
-Dave


(Ben Cohen) #5

Our general advice is to stick with String. Most APIs would be simpler and clearer just using String rather than being made generic (which itself can come at a cost), and user conversion on the way in on the few occasions that's needed isn’t much of a burden.

The exception is when your API is extremely likely to be used with substrings. The most common case for this is that it’s an API for doing string processing (such as with join or the number initializers), or when it’s very likely used with substring and conversion to string would be a major potential bottleneck (such as with set/dictionary). Also, very general algorithms that would benefit from being on StringProtocol might also be generalizable further to Sequence/Collection.

···

On Jun 28, 2017, at 2:41 PM, David Hart <david@hartbit.com> wrote:

Should authors use StringProtocol or String as often as possible?


(Ben Cohen) #6

Yes.

···

On Jun 28, 2017, at 10:26 AM, rintaro ishizaki <fs.output@gmail.com> wrote:

Does this mean it's guaranteed that Substring.hashValue is always equal to String.hashValue?