[Pitch] String revision proposal #1


(Ben Cohen) #1

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

Proposal: SE-0161 <file:///Users/ben_cohen/Documents/swift-evolution/proposals/0161-StringRevision1.md>
Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams <http://github.com/dabrahams/>
Review Manager: TBD
Status: Awaiting review
Introduction

This proposal is to implement a subset of the changes from the Swift 4 String Manifesto <https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

Make String conform to BidirectionalCollection
Make String conform to RangeReplaceableCollection
Create a Substring type for String.SubSequence
Create a Unicode protocol to allow for generic operations over both types.
Consolidate on a concise set of C interop methods.
Revise the transcoding infrastructure.
Other existing aspects of String remain unchanged for the purposes of this proposal.

Motivation

This proposal follows up on a number of recommendations found in the manifesto:

Collection conformance was dropped from String in Swift 2. After reevaluation, the feeling is that the minor semantic discrepancies (mainly with RangeReplaceableCollection) are outweighed by the significant benefits of restoring these conformances. For more detail on the reasoning, see here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>
While it is not a collection, the Swift 3 string does have slicing operations. String is currently serving as its own subsequence, allowing substrings to share storage with their “owner”. This can lead to memory leaks when small substrings of larger strings are stored long-term (see here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with a simpler set of methods on String.

Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to those described here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings. The primary change is to remove “non-repairing” variants of construction from nul-terminated C strings. In both of the construction APIs, any invalid encoding sequence detected will have its longest valid prefix replaced by U+FFFD, the Unicode replacement character, per the Unicode specification. This covers the common case. The replacement is done physically in the underlying storage and the validity of the result is recorded in the String’s encoding such that future accesses need not be slowed down by possible error repair separately. Construction that is aborted when encoding errors are detected can be accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and performance. The primary changes will be:

to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
to add the ability to transcode an input collection in reverse, allowing the different views on String to be made bi-directional
to have decoding take a collection rather than an iterator, and return an index of its progress into the source, allowing that method to be static
The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}

extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}

struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}
The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}
Note that properties or methods that due to their nature create new String storage (such as lowercased()) will not change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
  
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
    
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
Additionally, the current ability to pass a Swift String into C methods that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {
/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

/// Indicates that there was no more input to consume.
case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }
  
  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?
  
  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}

/// Parsing multiple unicode scalar values
extension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  
  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}
UnicodeCodec will be updated to refine UnicodeEncoding, and all existing codecs will conform to it.

Note, depending on whether this change lands before or after some of the generics features, generic where clauses may need to be added temporarily.

Source compatibility

Adding collection conformance to String should not materially impact source stability as it is purely additive: Swift 3’s String interface currently fulfills all of the requirements for a bidirectional range replaceable collection.

Altering String’s slicing operations to return a different type is source breaking. The following mitigating steps are proposed:

Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a String not a Substring.

Add deprecated versions of all current slicing methods to similarly return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}
In a review of 77 popular Swift projects found on GitHub, these changes resolved any build issues in the 12 projects that assumed an explicit String type returned from slicing operations.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

The C string interoperability methods outside the ones described in the detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and be removed in a subsequent release. UnicodeCodec will be similarly deprecated.

Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type (and its associated subsequence) is in a good long-term state before being locked down when Swift declares ABI stability. Shrinking the size of String to be 64 bits is an important part of this.

Effect on API resilience

Decisions about the API resilience of the String type are still to be determined, but are not adversely affected by this proposal.

Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.


(Xiaodi Wu) #2

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto
to be delivered in Swift 4?

Nits on naming: are we calling it Substring or SubString (à la
SubSequence)? and shouldn't it be UnicodeParsedResult rather than
UnicodeParseResult?

···

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a
number of changes that would allow the basic internals to be overhauled.

Online version here:
https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

   - Proposal: SE-0161
   - Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams
   <http://github.com/dabrahams/>
   - Review Manager: TBD
   - Status: *Awaiting review*

Introduction

This proposal is to implement a subset of the changes from the Swift 4
String Manifesto
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

   - Make String conform to BidirectionalCollection
   - Make String conform to RangeReplaceableCollection
   - Create a Substring type for String.SubSequence
   - Create a Unicode protocol to allow for generic operations over both
   types.
   - Consolidate on a concise set of C interop methods.
   - Revise the transcoding infrastructure.

Other existing aspects of String remain unchanged for the purposes of this
proposal.
Motivation

This proposal follows up on a number of recommendations found in the
manifesto:

Collection conformance was dropped from String in Swift 2. After
reevaluation, the feeling is that the minor semantic discrepancies (mainly
with RangeReplaceableCollection) are outweighed by the significant benefits
of restoring these conformances. For more detail on the reasoning, see here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>

While it is not a collection, the Swift 3 string does have slicing
operations. String is currently serving as its own subsequence, allowing
substrings to share storage with their “owner”. This can lead to memory
leaks when small substrings of larger strings are stored long-term (see here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings>
for
more detail on this problem). Introducing a separate type of Substring to
serve as String.Subsequence is recommended to resolve this issue, in a
similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C
strings in Swift 3 is scattered and incoherent, with 6 ways to transform a
C string into a String and four ways to do the inverse. These APIs should
be replaced with a simpler set of methods on String.
Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be
documented as only for short- to medium-term storage:

*Important*
Long-term storage of Substring instances is discouraged. A substring holds
a reference to the entire storage of a larger string, not just to the
portion it presents, even after the original string’s lifetime ends.
Long-term storage of a substring may therefore prolong the lifetime of
elements that are no longer otherwise accessible, which can appear to be
memory leakage.

Aside from minor differences, such as having a SubSequence of Self and a
larger size to describe the range of the subsequence, Substring will be
near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring,
a new Unicode protocol to which the two types will conform will be
introduced. For the purposes of this proposal, Unicode will be defined as a
protocol to be used whenver you would previously extend String. It should
be possible to substitute extension Unicode { ... } in Swift 4
wherever extension
String { ... } was written in Swift 3, with one exception: any passing of
self into an API that takes a concrete String will need to be rewritten as
String(self). If Self is a String then this should effectively optimize to
a no-op, whereas if Self is a Substring then this will force a copy,
helping to avoid the “memory leak” problems described above.

The exact nature of the protocol – such as which methods should be protocol
requirements vs which can be implemented as protocol extensions, are
considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection.
RangeReplaceableCollection conformance
will be added directly onto the String and Substring types, as it is
possible future Unicode-conforming types might not be range-replaceable
(e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to those described here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>:
a single withCString operation and two init(cString:) constructors, one for
UTF8 and one for arbitrary encodings. The primary change is to remove
“non-repairing” variants of construction from nul-terminated C strings. In
both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per the Unicode specification. This covers the common case. The
replacement is done physically in the underlying storage and the validity
of the result is recorded in the String’s encoding such that future
accesses need not be slowed down by possible error repair separately.
Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and
performance. The primary changes will be:

   - to allow transcoding directly from one encoding to another without
   having to triangulate through an intermediate scalar value
   - to add the ability to transcode an input collection in reverse,
   allowing the different views on String to be made bi-directional
   - to have decoding take a collection rather than an iterator, and return
   an index of its progress into the source, allowing that method to be static

The standard library currently lacks a Latin1 codec, so a enum Latin1:
UnicodeEncoding type will be added.
Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}
extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}
struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}

The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}

Note that properties or methods that due to their nature create new
String storage
(such as lowercased()) will *not* change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as
`nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

Additionally, the current ability to pass a Swift String into C methods
that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current
UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {/// Indicates valid input
was recognized.////// `resumptionPoint` is the end of the parsed
regioncase valid(T, resumptionPoint: Index) // FIXME: should these be
reordered?/// Indicates invalid input was recognized.//////
`resumptionPoint` is the next position at which to continue parsing
after/// the invalid input is repaired.case error(resumptionPoint:
Index)
/// Indicates that there was no more input to consume.case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}
/// An encoding for text with UnicodeScalar as a common currency
typepublic protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }

  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?

  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}
/// Parsing multiple unicode scalar valuesextension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)

  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

UnicodeCodec will be updated to refine UnicodeEncoding, and all existing
codecs will conform to it.

Note, depending on whether this change lands before or after some of the
generics features, generic where clauses may need to be added temporarily.
Source compatibility

Adding collection conformance to String should not materially impact source
stability as it is purely additive: Swift 3’s String interface currently
fulfills all of the requirements for a bidirectional range replaceable
collection.

Altering String’s slicing operations to return a different type is source
breaking. The following mitigating steps are proposed:

   -

   Add a deprecated subscript operator that will run in Swift 3
   compatibility mode and which will return a String not a Substring.
   -

   Add deprecated versions of all current slicing methods to similarly
   return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}

In a review of 77 popular Swift projects found on GitHub, these changes
resolved any build issues in the 12 projects that assumed an explicit
String type
returned from slicing operations.

Due to the change in internal implementation, this means that these
operations will be *O(n)* rather than *O(1)*. This is not expected to be a
major concern, based on experiences from a similar change made to Java, but
projects will be able to work around performance issues without upgrading
to Swift 4 by explicitly typing slices as Substring, which will call the
Swift 4 variant, and which will be available but not invoked by default in
Swift 3 mode.

The C string interoperability methods outside the ones described in the
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode,
and be removed in a subsequent release. UnicodeCodec will be similarly
deprecated.
Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type
(and its associated subsequence) is in a good long-term state before being
locked down when Swift declares ABI stability. Shrinking the size of String to
be 64 bits is an important part of this.
Effect on API resilience

Decisions about the API resilience of the String type are still to be
determined, but are not adversely affected by this proposal.
Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design,
see the manifesto and associated evolution thread
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>
.

This proposal does not yet introduce an implicit conversion from Substring
to String. The decision on whether to add this will be deferred pending
feedback on the initial implementation. The intention is to make a preview
toolchain available for feedback, including on whether this implicit
conversion is necessary, prior to the release of Swift 4.
Several of the types related to String, such as the encodings, would
ideally reside inside a namespace rather than live at the top level of the
standard library. The best namespace for this is probably Unicode, but this
is also the name of the protocol. At some point if we gain the ability to
nest enums and types inside protocols, they should be moved there. Putting
them inside String or some other enum namespace is probably not worthwhile
in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Brent Royal-Gordon) #3

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

Really great stuff, guys. Thanks for your work on this!

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.

If not, then given this:

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.

/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }
  
  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?

  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  
  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

Are there constraints missing on `parseForward`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!

···

On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

--
Brent Royal-Gordon
Architechies


(Zachary Waldowski) #4

Loving it so far.

`encode` and `parseScalar[Forward|Backward]` feel asymmetric. What's
wrong with `decode[Forward|Backward]`?

`UnicodeParseResult<T, Index>` really feels like it could/should be
defined as `UnicodeEncoding.ParseResult<Index>` (or `DecodeResult`,
given the above). I can't remember if that generics limitation was
being lifted?

Best,

  Zachary Waldowski

  zach@waldowski.me

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This
covers a number of changes that would allow the basic internals to be
overhauled.

Online version here:
https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

* Proposal: SE-0161
* Authors: Ben Cohen[1], Dave Abrahams[2]
* Review Manager: TBD
* Status: *Awaiting review*
Introduction

This proposal is to implement a subset of the changes from the Swift 4
String Manifesto[3].
Specifically:

* Make String conform to BidirectionalCollection
* Make String conform to RangeReplaceableCollection
* Create a Substring type for String.SubSequence
* Create a Unicode protocol to allow for generic operations over both
   types.
* Consolidate on a concise set of C interop methods.
* __Revise the transcoding infrastructure.
Other existing aspects of String remain unchanged for the purposes of
this proposal.
Motivation

This proposal follows up on a number of recommendations found in the
manifesto:
Collection conformance was dropped from String in Swift 2. After
reevaluation, the feeling is that the minor semantic discrepancies
(mainly with RangeReplaceableCollection) are outweighed by the
significant benefits of restoring these conformances. For more detail
on the reasoning, see here[4]
While it is not a collection, the Swift 3 string does have slicing
operations. String is currently serving as its own subsequence,
allowing substrings to share storage with their “owner”. This can lead
to memory leaks when small substrings of larger strings are stored long-
term (see here[5] for more detail on this problem). Introducing a
separate type of Substring to serve as String.Subsequence is
recommended to resolve this issue, in a similar fashion to ArraySlice.
As noted in the manifesto, support for interoperation with nul-
terminated C strings in Swift 3 is scattered and incoherent, with 6
ways to transform a C string into a String and four ways to do the
inverse. These APIs should be replaced with a simpler set of methods
on String.
Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it
will be documented as only for short- to medium-term storage:

*Important*

Long-term storage of Substring instances is discouraged. A substring
holds a reference to the entire storage of a larger string, not just
to the portion it presents, even after the original string’s lifetime
ends. Long-term storage of a substring may therefore prolong the
lifetime of elements that are no longer otherwise accessible, which
can appear to be memory leakage.

Aside from minor differences, such as having a SubSequence of Self and
a larger size to describe the range of the subsequence, Substring will
be near-identical from a user perspective.
In order to be able to write extensions accross both String and
Substring, a new Unicode protocol to which the two types will conform
will be introduced. For the purposes of this proposal, Unicode will be
defined as a protocol to be used whenver you would previously extend
String. It should be possible to substitute extension Unicode { ... }
in Swift 4 wherever extension String { ... } was written in Swift 3,
with one exception: any passing of self into an API that takes a
concrete String will need to be rewritten as String(self). If Self is
a String then this should effectively optimize to a no-op, whereas if
Self is a Substring then this will force a copy, helping to avoid the
“memory leak” problems described above.
The exact nature of the protocol – such as which methods should be
protocol requirements vs which can be implemented as protocol
extensions, are considered implementation details and so not covered
in this proposal.
Unicode will conform to BidirectionalCollection.
RangeReplaceableCollection conformance will be added directly onto the
String and Substring types, as it is possible future Unicode-
conforming types might not be range-replaceable (e.g. an immutable
type that wraps a const char *).
The C string interop methods will be updated to those described
here[6]: a single withCString operation and two init(cString:)
constructors, one for UTF8 and one for arbitrary encodings. The
primary change is to remove “non-repairing” variants of construction
from nul-terminated C strings. In both of the construction APIs, any
invalid encoding sequence detected will have its longest valid prefix
replaced by U+FFFD, the Unicode replacement character, per the Unicode
specification. This covers the common case. The replacement is done
physically in the underlying storage and the validity of the result is
recorded in the String’s encoding such that future accesses need not
be slowed down by possible error repair separately. Construction that
is aborted when encoding errors are detected can be accomplished using
APIs on the encoding.
The current transcoding support will be updated to improve usability
and performance. The primary changes will be:

* to allow transcoding directly from one encoding to another without
   having to triangulate through an intermediate scalar value
* to add the ability to transcode an input collection in reverse,
   allowing the different views on String to be made bi-directional
* to have decoding take a collection rather than an iterator, and
   return an index of its progress into the source, allowing that
   method to be static
The standard library currently lacks a Latin1 codec, so a enum Latin1:
UnicodeEncoding type will be added.
Detailed design

The following additions will be made to the standard library:

*protocol* *Unicode*: *BidirectionalCollection* { *// Implementation
detail as described above* }

*extension* *String*: *Unicode*, *RangeReplaceableCollection* {
*typealias* *SubSequence* = *Substring* }

*struct* *Substring*: *Unicode*, *RangeReplaceableCollection* {
*typealias* *SubSequence* = *Substring* *// near-identical API
surface area to String* }
The subscript operations on String will be amended to return
Substring:

*struct* *String* { *subscript*(bounds: *Range*<*String*.*Index*>) ->
*Substring* { *get* } *subscript*(bounds:
*ClosedRange*<*String*.*Index*>) -> *Substring* { *get* } }
Note that properties or methods that due to their nature create new
String storage (such as lowercased()) will *not* change.
C string interop will be consolidated on the following methods:

*extension* *String* { */// Constructs a `String` having the same
contents as `nulTerminatedUTF8`.* *///* */// - Parameter
nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded * *///
bytes ending just before the first zero byte (NUL character).*
*init*(cString nulTerminatedUTF8: *UnsafePointer*<*CChar*>) *///
Constructs a `String` having the same contents as
`nulTerminatedCodeUnits`.* *///* */// - Parameter
nulTerminatedCodeUnits: a sequence of contiguous code units in* *///
the given `encoding`, ending just before the first zero code unit.*
*/// - Parameter encoding: describes the encoding in which the code
units* */// should be interpreted.* *init*<*Encoding*:
*UnicodeEncoding*>( cString nulTerminatedCodeUnits:
*UnsafePointer*<*Encoding*.*CodeUnit*>, encoding: *Encoding*) *///
Invokes the given closure on the contents of the string, represented
as a* */// pointer to a null-terminated sequence of UTF-8 code
units.* *func* *withCString*<Result>( _ body: (UnsafePointer<CChar>)
*throws* -> *Result*) *rethrows* -> *Result* }
Additionally, the current ability to pass a Swift String into C
methods that take a C string will remain as-is.
A new protocol, UnicodeEncoding, will be added to replace the current
UnicodeCodec protocol:

*public* *enum* *UnicodeParseResult*<*T*, *Index*> { */// Indicates
valid input was recognized.* *///* */// `resumptionPoint` is the end
of the parsed region* *case* valid(*T*, resumptionPoint: *Index*) *//
FIXME: should these be reordered?* */// Indicates invalid input was
recognized.* *///* */// `resumptionPoint` is the next position at
which to continue parsing after* */// the invalid input is repaired.*
*case* error(resumptionPoint: *Index*)

*/// Indicates that there was no more input to consume.* *case*
emptyInput */// If any input was consumed, the point from which to
continue parsing.* *var* resumptionPoint: *Index*? { *switch* *self*
{ *case* .valid(_,*let* r): *return* r *case* .error(*let* r):
*return* r *case* .emptyInput: *return* nil } } }

*/// An encoding for text with UnicodeScalar as a common currency
type* *public* *protocol* *UnicodeEncoding* { */// The maximum number
of code units in an encoded unicode scalar value* *static* *var*
maxLengthOfEncodedScalar: *Int* { *get* } */// A type that can
represent a single UnicodeScalar as it is encoded in this* *///
encoding.* associatedtype *EncodedScalar* : *EncodedScalarProtocol*
*/// Produces a scalar of this encoding if possible; returns `nil`
otherwise.* *static* *func* *encode*<Scalar: EncodedScalarProtocol>(
_:Scalar) -> *Self*.*EncodedScalar*? */// Parse a single unicode
scalar forward from `input`.* *///* */// - Parameter knownCount: a
number of code units known to exist in `input`.* */// **Note:**
passing a known compile-time constant is strongly advised,* *///
even if it's zero.* *static* *func* *parseScalarForward*<C:
>( _ input: C, knownCount: Int */* = 0, via extension */* )
-> *ParseResult*<*EncodedScalar*, *C*.*Index*> *where*
*C*.*Iterator*.*Element* == *EncodedScalar*.*Iterator*.*Element*
*/// Parse a single unicode scalar in reverse from `input`.* *///*
*/// - Parameter knownCount: a number of code units known to exist in
`input`.* */// **Note:** passing a known compile-time constant is
strongly advised,* */// even if it's zero.* *static* *func*
*parseScalarReverse*<C: BidirectionalCollection>( _ input: C,
knownCount: Int */* = 0 , via extension */* ) ->
*ParseResult*<*EncodedScalar*, *C*.*Index*> *where*
*C*.*Iterator*.*Element* == *EncodedScalar*.*Iterator*.*Element* }

*/// Parsing multiple unicode scalar values* *extension*
*UnicodeEncoding* { @discardableResult *public* *static* *func*
*parseForward*<C: Collection>( _ input: C,
repairingIllFormedSequences makeRepairs: Bool = true, into output:
(EncodedScalar) *throws*->*Void* ) *rethrows* -> (remainder:
*C*.*SubSequence*, errorCount: *Int*) @discardableResult *public*
*static* *func* *parseReverse*<C: BidirectionalCollection>( _ input:
C, repairingIllFormedSequences makeRepairs: Bool = true, into output:
(EncodedScalar) *throws*->*Void* ) *rethrows* -> (remainder:
*C*.*SubSequence*, errorCount: *Int*) *where* *C*.*SubSequence* :
*BidirectionalCollection*, *C*.*SubSequence*.*SubSequence* ==
*C*.*SubSequence*, *C*.*SubSequence*.*Iterator*.*Element* ==
*EncodedScalar*.*Iterator*.*Element* }
UnicodeCodec will be updated to refine UnicodeEncoding, and all
existing codecs will conform to it.
Note, depending on whether this change lands before or after some of
the generics features, generic where clauses may need to be added
temporarily.
Source compatibility

Adding collection conformance to String should not materially impact
source stability as it is purely additive: Swift 3’s String interface
currently fulfills all of the requirements for a bidirectional range
replaceable collection.
Altering String’s slicing operations to return a different type is
source breaking. The following mitigating steps are proposed:

* Add a deprecated subscript operator that will run in Swift 3
   compatibility mode and which will return a String not a Substring.

* Add deprecated versions of all current slicing methods to similarly
   return a String.
i.e.:

*extension* *String* { *@available*(swift, obsoleted: 4)
*subscript*(bounds: *Range*<*Index*>) -> *String* { *return*
*String*(characters[bounds]) } *@available*(swift, obsoleted: 4)
*subscript*(bounds: *ClosedRange*<*Index*>) -> *String* { *return*
*String*(characters[bounds]) } }
In a review of 77 popular Swift projects found on GitHub, these
changes resolved any build issues in the 12 projects that assumed an
explicit String type returned from slicing operations.
Due to the change in internal implementation, this means that these
operations will be *O(n)* rather than *O(1)*. This is not expected to
be a major concern, based on experiences from a similar change made to
Java, but projects will be able to work around performance issues
without upgrading to Swift 4 by explicitly typing slices as Substring,
which will call the Swift 4 variant, and which will be available but
not invoked by default in Swift 3 mode.
The C string interoperability methods outside the ones described in
the detailed design will remain in Swift 3 mode, be deprecated in
Swift 4 mode, and be removed in a subsequent release. UnicodeCodec
will be similarly deprecated.
Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the
String type (and its associated subsequence) is in a good long-
term state before being locked down when Swift declares ABI
stability. Shrinking the size of String to be 64 bits is an
important part of this.
Effect on API resilience

Decisions about the API resilience of the String type are still to be
determined, but are not adversely affected by this proposal.
Alternatives considered

For a more in-depth discussion of some of the trade-offs in string
design, see the manifesto and associated evolution thread[7].
This proposal does not yet introduce an implicit conversion from
Substring to String. The decision on whether to add this will be
deferred pending feedback on the initial implementation. The intention
is to make a preview toolchain available for feedback, including on
whether this implicit conversion is necessary, prior to the release of
Swift 4.
Several of the types related to String, such as the encodings, would
ideally reside inside a namespace rather than live at the top level of
the standard library. The best namespace for this is probably Unicode,
but this is also the name of the protocol. At some point if we gain
the ability to nest enums and types inside protocols, they should be
moved there. Putting them inside String or some other enum namespace
is probably not worthwhile in the mean-time.
_________________________________________________

swift-evolution mailing list

swift-evolution@swift.org

https://lists.swift.org/mailman/listinfo/swift-evolution

Links:

  1. https://github.com/airspeedswift
  2. http://github.com/dabrahams/
  3. https://github.com/apple/swift/blob/master/docs/StringManifesto.md
  4. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again
  5. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings
  6. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop
  7. https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497

···

On Wed, Mar 29, 2017, at 08:32 PM, Ben Cohen via swift-evolution wrote:


(Adrian Zubarev) #5

I haven’t followed the topic and while reading the proposal I found it a little confusing that we have inconsistent type names. I’m not a native English speaker so that’s might be the main case for my confusion here, so I’d appreciate for any clarification. :wink:

SubSequence vs. Substring and not SubString.

The word substring is an English word, but so is subsequence (I double checked here).

So where exactly is the issue here? Is it SubSequence which is written in camel case or is it Substring which is not?

···

--
Adrian Zubarev
Sent with Airmail

Am 30. März 2017 um 02:32:39, Ben Cohen via swift-evolution (swift-evolution@swift.org) schrieb:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

Proposal: SE-0161
Authors: Ben Cohen, Dave Abrahams
Review Manager: TBD
Status: Awaiting review
Introduction

This proposal is to implement a subset of the changes from the Swift 4 String Manifesto.

Specifically:

Make String conform to BidirectionalCollection
Make String conform to RangeReplaceableCollection
Create a Substring type for String.SubSequence
Create a Unicode protocol to allow for generic operations over both types.
Consolidate on a concise set of C interop methods.
Revise the transcoding infrastructure.
Other existing aspects of String remain unchanged for the purposes of this proposal.

Motivation

This proposal follows up on a number of recommendations found in the manifesto:

Collection conformance was dropped from String in Swift 2. After reevaluation, the feeling is that the minor semantic discrepancies (mainly with RangeReplaceableCollection) are outweighed by the significant benefits of restoring these conformances. For more detail on the reasoning, see here

While it is not a collection, the Swift 3 string does have slicing operations. String is currently serving as its own subsequence, allowing substrings to share storage with their “owner”. This can lead to memory leaks when small substrings of larger strings are stored long-term (see here for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with a simpler set of methods on String.

Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings. The primary change is to remove “non-repairing” variants of construction from nul-terminated C strings. In both of the construction APIs, any invalid encoding sequence detected will have its longest valid prefix replaced by U+FFFD, the Unicode replacement character, per the Unicode specification. This covers the common case. The replacement is done physically in the underlying storage and the validity of the result is recorded in the String’s encoding such that future accesses need not be slowed down by possible error repair separately. Construction that is aborted when encoding errors are detected can be accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and performance. The primary changes will be:

to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
to add the ability to transcode an input collection in reverse, allowing the different views on String to be made bi-directional
to have decoding take a collection rather than an iterator, and return an index of its progress into the source, allowing that method to be static
The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}

extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}

struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}
The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}
Note that properties or methods that due to their nature create new String storage (such as lowercased()) will not change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
   
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
     
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
Additionally, the current ability to pass a Swift String into C methods that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {
/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

/// Indicates that there was no more input to consume.
case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }
   
  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?
   
  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}

/// Parsing multiple unicode scalar values
extension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
   
  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}
UnicodeCodec will be updated to refine UnicodeEncoding, and all existing codecs will conform to it.

Note, depending on whether this change lands before or after some of the generics features, generic where clauses may need to be added temporarily.

Source compatibility

Adding collection conformance to String should not materially impact source stability as it is purely additive: Swift 3’s String interface currently fulfills all of the requirements for a bidirectional range replaceable collection.

Altering String’s slicing operations to return a different type is source breaking. The following mitigating steps are proposed:

Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a String not a Substring.

Add deprecated versions of all current slicing methods to similarly return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}
In a review of 77 popular Swift projects found on GitHub, these changes resolved any build issues in the 12 projects that assumed an explicit String type returned from slicing operations.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

The C string interoperability methods outside the ones described in the detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and be removed in a subsequent release. UnicodeCodec will be similarly deprecated.

Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type (and its associated subsequence) is in a good long-term state before being locked down when Swift declares ABI stability. Shrinking the size of String to be 64 bits is an important part of this.

Effect on API resilience

Decisions about the API resilience of the String type are still to be determined, but are not adversely affected by this proposal.

Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Karl) #6

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

Did you consider an AnyUnicode<Encoding> wrapper? Then we could have a typealias called “AnyString”.

Also, regarding naming: “Unicode” is great if this was a namespace, and this proposal is a great example of why protocol nesting is badly needed in Swift code which defines (not even very complex) protocols. However, absent protocol nesting, I think “UnicodeEncoded” is better. It doesn’t roll off the tongue as nicely, perhaps, but it also doesn’t look as weird when written in code.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

I’d hope they do get a proposal at some stage, though. There are cases where I’d like to be able to write my own “Unicode” type and take advantage of generic (and existential when we can) text processing.

For example, maybe the thing I want to present as a single block of text is actually pieced together from multiple discontiguous regions of a buffer (i.e. the “buffer-gap” approach for faster random insertions/deletions, if I expect my code to be doing lots of that).

You could imagine that if something like CoreText (can’t speak for them, of course) were being rewritten in Swift, it would be able to compute layouts and render glyphs from any provider of unicode data and not just String or Substring. I mean, that’s my dream, anyway. It would mean you could go directly from a buffer-gap String to a rendered bitmap suitable for UI.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

+1. Keep the protocol focussed.

The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

I feel this is a call for better naming somewhere.

  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

So will this replace the stuff which Foundation puts in to String, which also decodes a C string in to Swift string?

Foundation includes more encodings (and also nests an “Encoding” enum in String itself, which makes things even more confusing), but totally ignores standard library decodes in favour of CF ones.

- Karl


(Ben Rimmington) #7

Re: <https://github.com/apple/swift-evolution/pull/662>

### C String Interop

Will the `init(cString: UnsafePointer<UInt8>)` API be deprecated in Swift 4 mode?

<https://github.com/apple/swift/blob/8f3bc160c56461789261d2555dc5485c0484aad5/stdlib/public/core/CString.swift#L55-L63>

It was added by SE-0107 to avoid unsafe pointer conversions:

<https://github.com/apple/swift-evolution/blob/master/proposals/0107-unsaferawpointer.md#cstring-conversion>

### C Primitive Types

Will the `CChar` typealias be *unsigned* on some platforms?

<https://github.com/apple/swift/blob/8f3bc160c56461789261d2555dc5485c0484aad5/stdlib/public/core/CTypes.swift#L15-L19>

Could some of those typealiases in CTypes.swift be moved to a C header, so that they're always imported as the correct types for each platform?

//===--- CTypes.h ------===//
typedef char CChar;
typedef signed char CSignedChar;
typedef unsigned char CUnsignedChar;
typedef short CShort;
typedef unsigned short CUnsignedShort;
typedef int CInt;
typedef unsigned int CUnsignedInt;
typedef long CLong;
typedef unsigned long CUnsignedLong;
typedef long long CLongLong;
typedef unsigned long long CUnsignedLongLong;
typedef float CFloat;
typedef double CDouble;

For example, CTypes.swift for 64-bit Windows currently has:
* `CLong = Int32` versus `CUnsignedLong = UInt`,
* `CLongLong = Int` versus `CUnsignedLongLong = UInt64`.


(Ben Cohen) #8

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto to be delivered in Swift 4?

First of several. This lays the ground work for the changes to the underlying implementation. Other changes will mostly be additive on top.

Nits on naming: are we calling it Substring or SubString (à la SubSequence)?

This is venturing into subjective territory, so these are just my feelings rather than something definitive (Dave may differ) but:

It should definitely be Substring. My rule of thumb: if you might hyphenate it, you can capitalize it. I don’t think anyone spells it "sub-string". OTOH one might write "sub-sequence". Generally hyphens disappear in english as things come into common usage i.e. it used to be e-mail but now it’s mostly just email. Substring is enough of a term of art in programming that this has happened. Admittedly, Subsequence is a term of art too – unfortunately one that has a different meaning to ours ("a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements" e.g. <A,C,E> is a Subsequence of <A,B,C,D,E> – see https://en.wikipedia.org/wiki/Subsequence). Even worse, the mathematical term for what we are calling a subsequence is a Substring!

If we were change anything, my vote would be to lowercase Subsequence. We can typealias SubSequence = Subsequence to aid migration, with a slow burn on deprecating it since it’ll be quite a footling deprecation. I don’t know if it’s worth it though – the main use of “SubSequence” is currently in those pesky where clauses you have to put on all your Collection extensions if you want to use slicing, and many of these will be eliminated once we have the ability to put where clauses on associated types.

and shouldn't it be UnicodeParsedResult rather than UnicodeParseResult?

I think Parse. As in, this is the result of a parse, not these are the parsed results (though it does contain parsed results in some cases, but not all).

···

On Mar 29, 2017, at 6:59 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

Proposal: SE-0161 <>
Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams <http://github.com/dabrahams/>
Review Manager: TBD
Status: Awaiting review
Introduction

This proposal is to implement a subset of the changes from the Swift 4 String Manifesto <https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

Make String conform to BidirectionalCollection
Make String conform to RangeReplaceableCollection
Create a Substring type for String.SubSequence
Create a Unicode protocol to allow for generic operations over both types.
Consolidate on a concise set of C interop methods.
Revise the transcoding infrastructure.
Other existing aspects of String remain unchanged for the purposes of this proposal.

Motivation

This proposal follows up on a number of recommendations found in the manifesto:

Collection conformance was dropped from String in Swift 2. After reevaluation, the feeling is that the minor semantic discrepancies (mainly with RangeReplaceableCollection) are outweighed by the significant benefits of restoring these conformances. For more detail on the reasoning, see here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>
While it is not a collection, the Swift 3 string does have slicing operations. String is currently serving as its own subsequence, allowing substrings to share storage with their “owner”. This can lead to memory leaks when small substrings of larger strings are stored long-term (see here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with a simpler set of methods on String.

Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to those described here <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings. The primary change is to remove “non-repairing” variants of construction from nul-terminated C strings. In both of the construction APIs, any invalid encoding sequence detected will have its longest valid prefix replaced by U+FFFD, the Unicode replacement character, per the Unicode specification. This covers the common case. The replacement is done physically in the underlying storage and the validity of the result is recorded in the String’s encoding such that future accesses need not be slowed down by possible error repair separately. Construction that is aborted when encoding errors are detected can be accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and performance. The primary changes will be:

to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
to add the ability to transcode an input collection in reverse, allowing the different views on String to be made bi-directional
to have decoding take a collection rather than an iterator, and return an index of its progress into the source, allowing that method to be static
The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}

extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}

struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}
The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}
Note that properties or methods that due to their nature create new String storage (such as lowercased()) will not change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
  
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
    
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
Additionally, the current ability to pass a Swift String into C methods that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {
/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

/// Indicates that there was no more input to consume.
case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }
  
  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?
  
  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}

/// Parsing multiple unicode scalar values
extension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  
  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}
UnicodeCodec will be updated to refine UnicodeEncoding, and all existing codecs will conform to it.

Note, depending on whether this change lands before or after some of the generics features, generic where clauses may need to be added temporarily.

Source compatibility

Adding collection conformance to String should not materially impact source stability as it is purely additive: Swift 3’s String interface currently fulfills all of the requirements for a bidirectional range replaceable collection.

Altering String’s slicing operations to return a different type is source breaking. The following mitigating steps are proposed:

Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a String not a Substring.

Add deprecated versions of all current slicing methods to similarly return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}
In a review of 77 popular Swift projects found on GitHub, these changes resolved any build issues in the 12 projects that assumed an explicit String type returned from slicing operations.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

The C string interoperability methods outside the ones described in the detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and be removed in a subsequent release. UnicodeCodec will be similarly deprecated.

Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type (and its associated subsequence) is in a good long-term state before being locked down when Swift declares ABI stability. Shrinking the size of String to be 64 bits is an important part of this.

Effect on API resilience

Decisions about the API resilience of the String type are still to be determined, but are not adversely affected by this proposal.

Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution


(Joshua Alvarado) #9

Restoring Collection conformance back to String is a big win for Swift!
This revision looks great but I agree with the naming I believe it should
be SubString not Substring. I think SubString looks odd written out over
Substring but it keeps the convention of SubSequence.

···

On Wed, Mar 29, 2017 at 7:59 PM, Xiaodi Wu via swift-evolution < swift-evolution@swift.org> wrote:

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto
to be delivered in Swift 4?

Nits on naming: are we calling it Substring or SubString (à la
SubSequence)? and shouldn't it be UnicodeParsedResult rather than
UnicodeParseResult?

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution < > swift-evolution@swift.org> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a
number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/
blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-
StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

   - Proposal: SE-0161
   - Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams
   <http://github.com/dabrahams/>
   - Review Manager: TBD
   - Status: *Awaiting review*

Introduction

This proposal is to implement a subset of the changes from the Swift 4
String Manifesto
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

   - Make String conform to BidirectionalCollection
   - Make String conform to RangeReplaceableCollection
   - Create a Substring type for String.SubSequence
   - Create a Unicode protocol to allow for generic operations over both
   types.
   - Consolidate on a concise set of C interop methods.
   - Revise the transcoding infrastructure.

Other existing aspects of String remain unchanged for the purposes of
this proposal.
Motivation

This proposal follows up on a number of recommendations found in the
manifesto:

Collection conformance was dropped from String in Swift 2. After
reevaluation, the feeling is that the minor semantic discrepancies (mainly
with RangeReplaceableCollection) are outweighed by the significant
benefits of restoring these conformances. For more detail on the reasoning,
see here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>

While it is not a collection, the Swift 3 string does have slicing
operations. String is currently serving as its own subsequence, allowing
substrings to share storage with their “owner”. This can lead to memory
leaks when small substrings of larger strings are stored long-term (see
here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> for
more detail on this problem). Introducing a separate type of Substring to
serve as String.Subsequence is recommended to resolve this issue, in a
similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated
C strings in Swift 3 is scattered and incoherent, with 6 ways to transform
a C string into a String and four ways to do the inverse. These APIs
should be replaced with a simpler set of methods on String.
Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will
be documented as only for short- to medium-term storage:

*Important*
Long-term storage of Substring instances is discouraged. A substring
holds a reference to the entire storage of a larger string, not just to the
portion it presents, even after the original string’s lifetime ends.
Long-term storage of a substring may therefore prolong the lifetime of
elements that are no longer otherwise accessible, which can appear to be
memory leakage.

Aside from minor differences, such as having a SubSequence of Self and a
larger size to describe the range of the subsequence, Substring will be
near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring,
a new Unicode protocol to which the two types will conform will be
introduced. For the purposes of this proposal, Unicode will be defined as
a protocol to be used whenver you would previously extend String. It
should be possible to substitute extension Unicode { ... } in Swift 4
wherever extension String { ... } was written in Swift 3, with one
exception: any passing of self into an API that takes a concrete String will
need to be rewritten as String(self). If Self is a String then this
should effectively optimize to a no-op, whereas if Self is a Substring then
this will force a copy, helping to avoid the “memory leak” problems
described above.

The exact nature of the protocol – such as which methods should be
protocol requirements vs which can be implemented as protocol extensions,
are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. Ra
ngeReplaceableCollection conformance will be added directly onto the
String and Substring types, as it is possible future Unicode-conforming
types might not be range-replaceable (e.g. an immutable type that wraps a const
char *).

The C string interop methods will be updated to those described here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>:
a single withCString operation and two init(cString:) constructors, one
for UTF8 and one for arbitrary encodings. The primary change is to remove
“non-repairing” variants of construction from nul-terminated C strings. In
both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per the Unicode specification. This covers the common case. The
replacement is done physically in the underlying storage and the validity
of the result is recorded in the String’s encoding such that future
accesses need not be slowed down by possible error repair separately.
Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and
performance. The primary changes will be:

   - to allow transcoding directly from one encoding to another without
   having to triangulate through an intermediate scalar value
   - to add the ability to transcode an input collection in reverse,
   allowing the different views on String to be made bi-directional
   - to have decoding take a collection rather than an iterator, and
   return an index of its progress into the source, allowing that method to be
   static

The standard library currently lacks a Latin1 codec, so a enum Latin1:
UnicodeEncoding type will be added.
Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}
extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}
struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}

The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}

Note that properties or methods that due to their nature create new String storage
(such as lowercased()) will *not* change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

Additionally, the current ability to pass a Swift String into C methods
that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current
UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {/// Indicates valid input was recognized.////// `resumptionPoint` is the end of the parsed regioncase valid(T, resumptionPoint: Index) // FIXME: should these be reordered?/// Indicates invalid input was recognized.////// `resumptionPoint` is the next position at which to continue parsing after/// the invalid input is repaired.case error(resumptionPoint: Index)
/// Indicates that there was no more input to consume.case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}
/// An encoding for text with UnicodeScalar as a common currency typepublic protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }

  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?

  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}
/// Parsing multiple unicode scalar valuesextension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)

  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

UnicodeCodec will be updated to refine UnicodeEncoding, and all existing
codecs will conform to it.

Note, depending on whether this change lands before or after some of the
generics features, generic where clauses may need to be added temporarily.
Source compatibility

Adding collection conformance to String should not materially impact
source stability as it is purely additive: Swift 3’s String interface
currently fulfills all of the requirements for a bidirectional range
replaceable collection.

Altering String’s slicing operations to return a different type is source
breaking. The following mitigating steps are proposed:

   -

   Add a deprecated subscript operator that will run in Swift 3
   compatibility mode and which will return a String not a Substring.
   -

   Add deprecated versions of all current slicing methods to similarly
   return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}

In a review of 77 popular Swift projects found on GitHub, these changes
resolved any build issues in the 12 projects that assumed an explicit
String type returned from slicing operations.

Due to the change in internal implementation, this means that these
operations will be *O(n)* rather than *O(1)*. This is not expected to be
a major concern, based on experiences from a similar change made to Java,
but projects will be able to work around performance issues without
upgrading to Swift 4 by explicitly typing slices as Substring, which will
call the Swift 4 variant, and which will be available but not invoked by
default in Swift 3 mode.

The C string interoperability methods outside the ones described in the
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode,
and be removed in a subsequent release. UnicodeCodec will be similarly
deprecated.
Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type
(and its associated subsequence) is in a good long-term state before being
locked down when Swift declares ABI stability. Shrinking the size of
String to be 64 bits is an important part of this.
Effect on API resilience

Decisions about the API resilience of the String type are still to be
determined, but are not adversely affected by this proposal.
Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design,
see the manifesto and associated evolution thread
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>
.

This proposal does not yet introduce an implicit conversion from Substring
to String. The decision on whether to add this will be deferred pending
feedback on the initial implementation. The intention is to make a preview
toolchain available for feedback, including on whether this implicit
conversion is necessary, prior to the release of Swift 4.
Several of the types related to String, such as the encodings, would
ideally reside inside a namespace rather than live at the top level of the
standard library. The best namespace for this is probably Unicode, but
this is also the name of the protocol. At some point if we gain the ability
to nest enums and types inside protocols, they should be moved there.
Putting them inside String or some other enum namespace is probably not
worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Joshua Alvarado
alvaradojoshua0@gmail.com


(Félix Cloutier) #10

I don't have much non-nitpick issues that I greatly care about; I'm in favor of this.

My only request: it's currently painful to create a String from a fixed-size C array. For instance, if I have a pointer to a `struct foo { char name[16]; }` in Swift where the last character doesn't have to be a NUL, it's hard to create a String from it. Real-world examples of this are Mach-O LC_SEGMENT and LC_SEGMENT_64 commands.

The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> is that you take a pointer to the CChar tuple that represents the fixed-size array, but this still requires the string to be NUL-terminated. What do we think of an additional init(cString:) overload that takes an UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, whichever comes first?

···

Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org> a écrit :

On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

Really great stuff, guys. Thanks for your work on this!

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.

If not, then given this:

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.

/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
/// The maximum number of code units in an encoded unicode scalar value
static var maxLengthOfEncodedScalar: Int { get }

/// A type that can represent a single UnicodeScalar as it is encoded in this
/// encoding.
associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?

@discardableResult
public static func parseForward<C: Collection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)

@discardableResult
public static func parseReverse<C: BidirectionalCollection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)
where C.SubSequence : BidirectionalCollection,
       C.SubSequence.SubSequence == C.SubSequence,
       C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

Are there constraints missing on `parseForward`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Ben Cohen) #11

Hi Brent,

Thanks for the notes. Replies inline.

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

Really great stuff, guys. Thanks for your work on this!

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.

We’re a bit on the fence about whether Unicode or StringProtocol is the better name.

The big win for Unicode is it is short. We want to encourage people to write their extensions on this protocol. We want people who previously extended String to feel very comfortable extending Unicode. It also helps emphasis how important the Unicode-ness of Swift.String is. I like the idea of Unicode.Collection, but it is a little intimidating and making it even a tiny bit intimidating is worrying to me from an adoption perspective.

If not, then given this:

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)

Definitely. We just want to minimize churn on the group to keep the discussion followable on the broader principles for as many as possible. Once it’s firmed up and we’ve had implementation/useability/performance feedback, we’ll be back.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

You can always assign to self. Then provide more efficient implementations where RangeReplaceableCollection. We do this elsewhere in the std lib with collections e.g. https://github.com/apple/swift/blob/master/stdlib/public/core/Collection.swift#L1277.

Proliferating protocol combinations is problematic (looking at you, BidirectionalMutableRandomAccessSlice).

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

Hmm. Is this a common use-case people have? Symmetry for the sake of it doesn’t seem enough. If uncommon, you can do it via an Array that you nul-terminate manually.

The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.

Yeah, it’s tempting to make ParseResult general, and the only reason we held off is because we don’t want making sure it’s generally useful to be a distraction.

As a rule, T is as good as any other name when another name (say, “Value”) would that name would be vacuous or tortured. Even with it being specific to Unicode, there isn’t really a good other name for it.

(for an example elsewhere in the stdlib, we use T for min<T: Comparable>(x: T, y: T) -> Bool – trying to force in Value or MyComparable or SomeComparableThing wouldn’t be helpful).

/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

Oops meant to delete that FIXME for the purposes of the proposal!

/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.

This documentation should definitely happen as part of the fuller implementation, yes.

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
/// The maximum number of code units in an encoded unicode scalar value
static var maxLengthOfEncodedScalar: Int { get }

/// A type that can represent a single UnicodeScalar as it is encoded in this
/// encoding.
associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?

Ah, yes. Here it is:

public protocol EncodedScalarProtocol : RandomAccessCollection {
  init?(_ scalarValue: UnicodeScalar)
  var utf8: UTF8.EncodedScalar { get }
  var utf16: UTF16.EncodedScalar { get }
  var utf32: UTF32.EncodedScalar { get }
}

This is only really here as a (possibly premature) optimization – a fast path to go from very common encodings of scalars to another without having to turn them into a scalar and back. It doesn’t relate to much else.

@discardableResult
public static func parseForward<C: Collection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)

@discardableResult
public static func parseReverse<C: BidirectionalCollection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)
where C.SubSequence : BidirectionalCollection,
       C.SubSequence.SubSequence == C.SubSequence,
       C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

Are there constraints missing on `parseForward`?

Yep – see the note that appears a little later. They’re really implementation details – so not something to capture in the proposal – which may or may not be needed depending on whether this lands before or after the generics features that make them redundant.

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

The Unicode standard specifies values to substitute when making repairs.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

Are you talking about this as a way for people to change their code, while still being able to compile their code with the old compiler? Yes, that might be a good strategy, will think about that.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!

This is good to hear, including the last part, thanks.

···

On Mar 30, 2017, at 2:48 AM, Brent Royal-Gordon <brent@architechies.com> wrote:

On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

--
Brent Royal-Gordon
Architechies


(Joshua Alvarado) #12

...my vote would be to lowercase Subsequence. We can typealias
SubSequence = Subsequence to aid migration

+1 didn't think that was an option. A good solution would be to have them
either camel case (SubString, SubSequence) or just capitalized (Substring,
Substring) either would be nice as long as they were matching.

···

On Thu, Mar 30, 2017 at 9:38 AM, Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

On Mar 29, 2017, at 6:59 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto
to be delivered in Swift 4?

First of several. This lays the ground work for the changes to the
underlying implementation. Other changes will mostly be additive on top.

Nits on naming: are we calling it Substring or SubString (à la
SubSequence)?

This is venturing into subjective territory, so these are just my feelings
rather than something definitive (Dave may differ) but:

It should definitely be Substring. My rule of thumb: if you might
hyphenate it, you can capitalize it. I don’t think anyone spells it
"sub-string". OTOH one *might* write "sub-sequence". Generally hyphens
disappear in english as things come into common usage i.e. it used to be
e-mail but now it’s mostly just email. Substring is enough of a term of
art in programming that this has happened. Admittedly, Subsequence is a
term of art too – unfortunately one that has a different meaning to ours
("a sequence that can be derived from another sequence by deleting some
elements without changing the order of the remaining elements" e.g. <A,C,E>
is a Subsequence of <A,B,C,D,E> – see https://en.wikipedia.org/
wiki/Subsequence). Even worse, the mathematical term for what we are
calling a subsequence is a Substring!

If we were change anything, my vote would be to lowercase Subsequence. We
can typealias SubSequence = Subsequence to aid migration, with a slow burn
on deprecating it since it’ll be quite a footling deprecation. I don’t know
if it’s worth it though – the main use of “SubSequence” is currently in
those pesky where clauses you have to put on all your Collection extensions
if you want to use slicing, and many of these will be eliminated once we
have the ability to put where clauses on associated types.

and shouldn't it be UnicodeParsedResult rather than UnicodeParseResult?

I think Parse. As in, this is the result of a parse, not these are the
parsed results (though it does contain parsed results in some cases, but
not all).

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution < > swift-evolution@swift.org> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a
number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/
blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-
StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

   - Proposal: SE-0161
   - Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams
   <http://github.com/dabrahams/>
   - Review Manager: TBD
   - Status: *Awaiting review*

Introduction

This proposal is to implement a subset of the changes from the Swift 4
String Manifesto
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

   - Make String conform to BidirectionalCollection
   - Make String conform to RangeReplaceableCollection
   - Create a Substring type for String.SubSequence
   - Create a Unicode protocol to allow for generic operations over both
   types.
   - Consolidate on a concise set of C interop methods.
   - Revise the transcoding infrastructure.

Other existing aspects of String remain unchanged for the purposes of
this proposal.
Motivation

This proposal follows up on a number of recommendations found in the
manifesto:

Collection conformance was dropped from String in Swift 2. After
reevaluation, the feeling is that the minor semantic discrepancies (mainly
with RangeReplaceableCollection) are outweighed by the significant
benefits of restoring these conformances. For more detail on the reasoning,
see here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>

While it is not a collection, the Swift 3 string does have slicing
operations. String is currently serving as its own subsequence, allowing
substrings to share storage with their “owner”. This can lead to memory
leaks when small substrings of larger strings are stored long-term (see
here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> for
more detail on this problem). Introducing a separate type of Substring to
serve as String.Subsequence is recommended to resolve this issue, in a
similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated
C strings in Swift 3 is scattered and incoherent, with 6 ways to transform
a C string into a String and four ways to do the inverse. These APIs
should be replaced with a simpler set of methods on String.
Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will
be documented as only for short- to medium-term storage:

*Important*
Long-term storage of Substring instances is discouraged. A substring
holds a reference to the entire storage of a larger string, not just to the
portion it presents, even after the original string’s lifetime ends.
Long-term storage of a substring may therefore prolong the lifetime of
elements that are no longer otherwise accessible, which can appear to be
memory leakage.

Aside from minor differences, such as having a SubSequence of Self and a
larger size to describe the range of the subsequence, Substring will be
near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring,
a new Unicode protocol to which the two types will conform will be
introduced. For the purposes of this proposal, Unicode will be defined as
a protocol to be used whenver you would previously extend String. It
should be possible to substitute extension Unicode { ... } in Swift 4
wherever extension String { ... } was written in Swift 3, with one
exception: any passing of self into an API that takes a concrete String will
need to be rewritten as String(self). If Self is a String then this
should effectively optimize to a no-op, whereas if Self is a Substring then
this will force a copy, helping to avoid the “memory leak” problems
described above.

The exact nature of the protocol – such as which methods should be
protocol requirements vs which can be implemented as protocol extensions,
are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. Ra
ngeReplaceableCollection conformance will be added directly onto the
String and Substring types, as it is possible future Unicode-conforming
types might not be range-replaceable (e.g. an immutable type that wraps a const
char *).

The C string interop methods will be updated to those described here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>:
a single withCString operation and two init(cString:) constructors, one
for UTF8 and one for arbitrary encodings. The primary change is to remove
“non-repairing” variants of construction from nul-terminated C strings. In
both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per the Unicode specification. This covers the common case. The
replacement is done physically in the underlying storage and the validity
of the result is recorded in the String’s encoding such that future
accesses need not be slowed down by possible error repair separately.
Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and
performance. The primary changes will be:

   - to allow transcoding directly from one encoding to another without
   having to triangulate through an intermediate scalar value
   - to add the ability to transcode an input collection in reverse,
   allowing the different views on String to be made bi-directional
   - to have decoding take a collection rather than an iterator, and
   return an index of its progress into the source, allowing that method to be
   static

The standard library currently lacks a Latin1 codec, so a enum Latin1:
UnicodeEncoding type will be added.
Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}
extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}
struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}

The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}

Note that properties or methods that due to their nature create new String storage
(such as lowercased()) will *not* change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

Additionally, the current ability to pass a Swift String into C methods
that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current
UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {/// Indicates valid input was recognized.////// `resumptionPoint` is the end of the parsed regioncase valid(T, resumptionPoint: Index) // FIXME: should these be reordered?/// Indicates invalid input was recognized.////// `resumptionPoint` is the next position at which to continue parsing after/// the invalid input is repaired.case error(resumptionPoint: Index)
/// Indicates that there was no more input to consume.case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}
/// An encoding for text with UnicodeScalar as a common currency typepublic protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }

  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?

  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}
/// Parsing multiple unicode scalar valuesextension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)

  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

UnicodeCodec will be updated to refine UnicodeEncoding, and all existing
codecs will conform to it.

Note, depending on whether this change lands before or after some of the
generics features, generic where clauses may need to be added temporarily.
Source compatibility

Adding collection conformance to String should not materially impact
source stability as it is purely additive: Swift 3’s String interface
currently fulfills all of the requirements for a bidirectional range
replaceable collection.

Altering String’s slicing operations to return a different type is source
breaking. The following mitigating steps are proposed:

   -

   Add a deprecated subscript operator that will run in Swift 3
   compatibility mode and which will return a String not a Substring.
   -

   Add deprecated versions of all current slicing methods to similarly
   return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}

In a review of 77 popular Swift projects found on GitHub, these changes
resolved any build issues in the 12 projects that assumed an explicit
String type returned from slicing operations.

Due to the change in internal implementation, this means that these
operations will be *O(n)* rather than *O(1)*. This is not expected to be
a major concern, based on experiences from a similar change made to Java,
but projects will be able to work around performance issues without
upgrading to Swift 4 by explicitly typing slices as Substring, which will
call the Swift 4 variant, and which will be available but not invoked by
default in Swift 3 mode.

The C string interoperability methods outside the ones described in the
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode,
and be removed in a subsequent release. UnicodeCodec will be similarly
deprecated.
Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type
(and its associated subsequence) is in a good long-term state before being
locked down when Swift declares ABI stability. Shrinking the size of
String to be 64 bits is an important part of this.
Effect on API resilience

Decisions about the API resilience of the String type are still to be
determined, but are not adversely affected by this proposal.
Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design,
see the manifesto and associated evolution thread
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>
.

This proposal does not yet introduce an implicit conversion from Substring
to String. The decision on whether to add this will be deferred pending
feedback on the initial implementation. The intention is to make a preview
toolchain available for feedback, including on whether this implicit
conversion is necessary, prior to the release of Swift 4.
Several of the types related to String, such as the encodings, would
ideally reside inside a namespace rather than live at the top level of the
standard library. The best namespace for this is probably Unicode, but
this is also the name of the protocol. At some point if we gain the ability
to nest enums and types inside protocols, they should be moved there.
Putting them inside String or some other enum namespace is probably not
worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Joshua Alvarado
alvaradojoshua0@gmail.com


(Brent Royal-Gordon) #13

I was going to make the same argument, but you beat me to it.

"String" and "Substring" are both terms of art. (That's why there's no adjective form of "string", which makes naming the protocol difficult.) And they're probably the most widely-used terms of art in programming. "Substring" is inconsistent with other parts of the language, but for a good reason.

Keep it.

···

On Mar 30, 2017, at 8:38 AM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

It should definitely be Substring. My rule of thumb: if you might hyphenate it, you can capitalize it.

--
Brent Royal-Gordon
Sent from my iPhone


(Brent Royal-Gordon) #14

The big win for Unicode is it is short. We want to encourage people to write their extensions on this protocol. We want people who previously extended String to feel very comfortable extending Unicode. It also helps emphasis how important the Unicode-ness of Swift.String is. I like the idea of Unicode.Collection, but it is a little intimidating and making it even a tiny bit intimidating is worrying to me from an adoption perspective.

Yeah, I understand why "Collection" might be intimidating. But I think "Unicode" would be too—it's opaque enough that people wouldn't be entirely sure whether they were extending the right thing.

I did a quick run-through of different language and the protocols/interfaces/whatever their string types conform to, but most don't seem to have anything that abstracts string types. The only similar things I could find were `CharSequence` in Java, `StringLike` in Scala...and `Stringy` in Perl 6. And I'm sure you thought you were joking!

Honestly, I'd recommend just going with `StringProtocol` unless you can come up with an adjective form you like (`Stringlike`? `Textual`?). It's a bit clumsy, but it's crystal clear. Stupid name, but you'll never forget it.

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

You can always assign to self. Then provide more efficient implementations where RangeReplaceableCollection. We do this elsewhere in the std lib with collections e.g. https://github.com/apple/swift/blob/master/stdlib/public/core/Collection.swift#L1277.

Proliferating protocol combinations is problematic (looking at you, BidirectionalMutableRandomAccessSlice).

Nobody likes proliferation, but in this case it'd be because there genuinely were additional semantics that were only available on mutable strings.

(Once upon a time, I think I requested the ability to write `func index(of elem: Iterator.Element) -> Index? where Iterator.Element: Equatable`. Could such a feature be used for this? `func apply(_ transform: StringTransform, reverse: Bool) where Self: RangeReplaceableCollection`?)

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

Hmm. Is this a common use-case people have? Symmetry for the sake of it doesn’t seem enough. If uncommon, you can do it via an Array that you nul-terminate manually.

Is `init(cString:encoding:)` a common use case? If it is, I'm not sure why the opposite wouldn't be.

Yeah, it’s tempting to make ParseResult general, and the only reason we held off is because we don’t want making sure it’s generally useful to be a distraction.

Understandable.

I wonder if some part of the parsing algorithm could somehow be generalized so it was suitable for many purposes and then put on `Collection`, with the `UnicodeEncoding` then being passed as a parameter to it. If so, that would justify making `ParseResult` a top-level type.

Ah, yes. Here it is:

public protocol EncodedScalarProtocol : RandomAccessCollection {
init?(_ scalarValue: UnicodeScalar)
var utf8: UTF8.EncodedScalar { get }
var utf16: UTF16.EncodedScalar { get }
var utf32: UTF32.EncodedScalar { get }
}

What is the `Element` type expected to be here?

I think what's missing is a holistic overview of the encoding system. So, please help me write this function:

  func unicodeScalars<Encoding: UnicodeEncoding>(in data: Data, using encoding: Encoding.Type) -> [UnicodeScalar] {
    var scalars: [UnicodeScalar] = []
    
    data.withUnsafeBytes { (bytes: UnsafePointer<$ParseInputElement>) in
      let buffer = UnsafeBufferPointer(start: bytes, count: data.count / MemoryLayout<$ParseInputElement>.size)
      encoding.parseForward(buffer) { encodedScalar in
        let unicodeScalar: UnicodeScalar = $doSomething(encodedScalar)
        scalars.append(unicodeScalar)
      }
    }
    
    return scalars
  }

What type would I put for $ParseInputElement? What function or initializer do I call for $doSomething?

@discardableResult
public static func parseForward<C: Collection>(
  _ input: C,
  repairingIllFormedSequences makeRepairs: Bool = true,
  into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)

Are there constraints missing on `parseForward`?

Yep – see the note that appears a little later. They’re really implementation details – so not something to capture in the proposal – which may or may not be needed depending on whether this lands before or after the generics features that make them redundant.

No, I mean because this says nothing about `C`'s element type. Presumably you can't parse a bunch of `UIView`s into Unicode scalars, so there must be some kind of constraint on the collection's elements. What is it?

...oh, I notice that `parseScalarForward(_:knownCount:)` has the clause `where C.Iterator.Element == EncodedScalar.Iterator.Element` attached. Should that also be attached to `parseForward(_:repairingIllFormedSequences:into:)`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

The Unicode standard specifies values to substitute when making repairs.

I'm asking what happens if you *don't* want to make repairs. Does it, say, stop immediately, returning an `errorCount` of `1` and a `remainder` that starts at the site of the error? If so, would we better off having that parameter be something like `ifIllFormed: .stop` or `ifIllFormed: .repair`, rather than `repairingIllFormedSequences: false` or `repairingIllFormedSequences: true`?

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

Are you talking about this as a way for people to change their code, while still being able to compile their code with the old compiler? Yes, that might be a good strategy, will think about that.

Yes, that's what I'm talking about.

I guess the actual question is, does `#if swift(>=4)` come out as `true` for Swift 4 in Swift 3 mode? If not, is there some way to detect that you're using Swift 4 in Swift 3 mode? (I suppose one answer is "yes, Swift 4 in Swift 3 mode is called Swift 3.2"; I just haven't heard anyone mention anything like that yet.) In either case, if there's some way to distinguish, you could say:

  #if thisIsRealSwift3NotSwift4PretendingToBeSwift3()
  typealias Substring = String
  #endif

And then you could write the rest of your code using `Substring` and it would compile using both Swift 3 and Swift 4 toolchains, never forcing an implicit copy.

···

On Mar 30, 2017, at 2:36 PM, Ben Cohen <ben_cohen@apple.com> wrote:

--
Brent Royal-Gordon
Architechies


(Xiaodi Wu) #15

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto
to be delivered in Swift 4?

First of several. This lays the ground work for the changes to the
underlying implementation. Other changes will mostly be additive on top.

Nits on naming: are we calling it Substring or SubString (à la
SubSequence)?

This is venturing into subjective territory, so these are just my feelings
rather than something definitive (Dave may differ) but:

It should definitely be Substring. My rule of thumb: if you might
hyphenate it, you can capitalize it. I don’t think anyone spells it
"sub-string". OTOH one *might* write "sub-sequence". Generally hyphens
disappear in english as things come into common usage i.e. it used to be
e-mail but now it’s mostly just email. Substring is enough of a term of
art in programming that this has happened. Admittedly, Subsequence is a
term of art too – unfortunately one that has a different meaning to ours
("a sequence that can be derived from another sequence by deleting some
elements without changing the order of the remaining elements" e.g. <A,C,E>
is a Subsequence of <A,B,C,D,E> – see https://en.wikipedia.org/
wiki/Subsequence). Even worse, the mathematical term for what we are
calling a subsequence is a Substring!

If we were change anything, my vote would be to lowercase Subsequence. We
can typealias SubSequence = Subsequence to aid migration, with a slow burn
on deprecating it since it’ll be quite a footling deprecation. I don’t know
if it’s worth it though – the main use of “SubSequence” is currently in
those pesky where clauses you have to put on all your Collection extensions
if you want to use slicing, and many of these will be eliminated once we
have the ability to put where clauses on associated types.

I regret bringing this up. `Substring` is totally fine. `SubSequence` is
too. Just wanted to get some clarification that this was the proposed
spelling. I doubt it's worth a whole migration to change the capitalization
of `SubSequence`, which after all prevents the word from being read like
"consequence."

and shouldn't it be UnicodeParsedResult rather than UnicodeParseResult?

I think Parse. As in, this is the result of a parse, not these are the
parsed results (though it does contain parsed results in some cases, but
not all).

Ah, then `UnicodeParsingResult`, maybe? Something about nouning that verb
doesn't sit right. OK, done with bikeshedding.

···

On Thu, Mar 30, 2017 at 10:38 AM, Ben Cohen <ben_cohen@apple.com> wrote:

On Mar 29, 2017, at 6:59 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution < > swift-evolution@swift.org> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a
number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/
blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-
StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

   - Proposal: SE-0161
   - Authors: Ben Cohen <https://github.com/airspeedswift>, Dave Abrahams
   <http://github.com/dabrahams/>
   - Review Manager: TBD
   - Status: *Awaiting review*

Introduction

This proposal is to implement a subset of the changes from the Swift 4
String Manifesto
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md>.

Specifically:

   - Make String conform to BidirectionalCollection
   - Make String conform to RangeReplaceableCollection
   - Create a Substring type for String.SubSequence
   - Create a Unicode protocol to allow for generic operations over both
   types.
   - Consolidate on a concise set of C interop methods.
   - Revise the transcoding infrastructure.

Other existing aspects of String remain unchanged for the purposes of
this proposal.
Motivation

This proposal follows up on a number of recommendations found in the
manifesto:

Collection conformance was dropped from String in Swift 2. After
reevaluation, the feeling is that the minor semantic discrepancies (mainly
with RangeReplaceableCollection) are outweighed by the significant
benefits of restoring these conformances. For more detail on the reasoning,
see here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again>

While it is not a collection, the Swift 3 string does have slicing
operations. String is currently serving as its own subsequence, allowing
substrings to share storage with their “owner”. This can lead to memory
leaks when small substrings of larger strings are stored long-term (see
here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings> for
more detail on this problem). Introducing a separate type of Substring to
serve as String.Subsequence is recommended to resolve this issue, in a
similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated
C strings in Swift 3 is scattered and incoherent, with 6 ways to transform
a C string into a String and four ways to do the inverse. These APIs
should be replaced with a simpler set of methods on String.
Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will
be documented as only for short- to medium-term storage:

*Important*
Long-term storage of Substring instances is discouraged. A substring
holds a reference to the entire storage of a larger string, not just to the
portion it presents, even after the original string’s lifetime ends.
Long-term storage of a substring may therefore prolong the lifetime of
elements that are no longer otherwise accessible, which can appear to be
memory leakage.

Aside from minor differences, such as having a SubSequence of Self and a
larger size to describe the range of the subsequence, Substring will be
near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring,
a new Unicode protocol to which the two types will conform will be
introduced. For the purposes of this proposal, Unicode will be defined as
a protocol to be used whenver you would previously extend String. It
should be possible to substitute extension Unicode { ... } in Swift 4
wherever extension String { ... } was written in Swift 3, with one
exception: any passing of self into an API that takes a concrete String will
need to be rewritten as String(self). If Self is a String then this
should effectively optimize to a no-op, whereas if Self is a Substring then
this will force a copy, helping to avoid the “memory leak” problems
described above.

The exact nature of the protocol – such as which methods should be
protocol requirements vs which can be implemented as protocol extensions,
are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. Ra
ngeReplaceableCollection conformance will be added directly onto the
String and Substring types, as it is possible future Unicode-conforming
types might not be range-replaceable (e.g. an immutable type that wraps a const
char *).

The C string interop methods will be updated to those described here
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop>:
a single withCString operation and two init(cString:) constructors, one
for UTF8 and one for arbitrary encodings. The primary change is to remove
“non-repairing” variants of construction from nul-terminated C strings. In
both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per the Unicode specification. This covers the common case. The
replacement is done physically in the underlying storage and the validity
of the result is recorded in the String’s encoding such that future
accesses need not be slowed down by possible error repair separately.
Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and
performance. The primary changes will be:

   - to allow transcoding directly from one encoding to another without
   having to triangulate through an intermediate scalar value
   - to add the ability to transcode an input collection in reverse,
   allowing the different views on String to be made bi-directional
   - to have decoding take a collection rather than an iterator, and
   return an index of its progress into the source, allowing that method to be
   static

The standard library currently lacks a Latin1 codec, so a enum Latin1:
UnicodeEncoding type will be added.
Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}
extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}
struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}

The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}

Note that properties or methods that due to their nature create new String storage
(such as lowercased()) will *not* change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)

  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}

Additionally, the current ability to pass a Swift String into C methods
that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current
UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {/// Indicates valid input was recognized.////// `resumptionPoint` is the end of the parsed regioncase valid(T, resumptionPoint: Index) // FIXME: should these be reordered?/// Indicates invalid input was recognized.////// `resumptionPoint` is the next position at which to continue parsing after/// the invalid input is repaired.case error(resumptionPoint: Index)
/// Indicates that there was no more input to consume.case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}
/// An encoding for text with UnicodeScalar as a common currency typepublic protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }

  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?

  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}
/// Parsing multiple unicode scalar valuesextension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)

  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

UnicodeCodec will be updated to refine UnicodeEncoding, and all existing
codecs will conform to it.

Note, depending on whether this change lands before or after some of the
generics features, generic where clauses may need to be added temporarily.
Source compatibility

Adding collection conformance to String should not materially impact
source stability as it is purely additive: Swift 3’s String interface
currently fulfills all of the requirements for a bidirectional range
replaceable collection.

Altering String’s slicing operations to return a different type is source
breaking. The following mitigating steps are proposed:

   -

   Add a deprecated subscript operator that will run in Swift 3
   compatibility mode and which will return a String not a Substring.
   -

   Add deprecated versions of all current slicing methods to similarly
   return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}

In a review of 77 popular Swift projects found on GitHub, these changes
resolved any build issues in the 12 projects that assumed an explicit
String type returned from slicing operations.

Due to the change in internal implementation, this means that these
operations will be *O(n)* rather than *O(1)*. This is not expected to be
a major concern, based on experiences from a similar change made to Java,
but projects will be able to work around performance issues without
upgrading to Swift 4 by explicitly typing slices as Substring, which will
call the Swift 4 variant, and which will be available but not invoked by
default in Swift 3 mode.

The C string interoperability methods outside the ones described in the
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode,
and be removed in a subsequent release. UnicodeCodec will be similarly
deprecated.
Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type
(and its associated subsequence) is in a good long-term state before being
locked down when Swift declares ABI stability. Shrinking the size of
String to be 64 bits is an important part of this.
Effect on API resilience

Decisions about the API resilience of the String type are still to be
determined, but are not adversely affected by this proposal.
Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design,
see the manifesto and associated evolution thread
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497>
.

This proposal does not yet introduce an implicit conversion from Substring
to String. The decision on whether to add this will be deferred pending
feedback on the initial implementation. The intention is to make a preview
toolchain available for feedback, including on whether this implicit
conversion is necessary, prior to the release of Swift 4.
Several of the types related to String, such as the encodings, would
ideally reside inside a namespace rather than live at the top level of the
standard library. The best namespace for this is probably Unicode, but
this is also the name of the protocol. At some point if we gain the ability
to nest enums and types inside protocols, they should be moved there.
Putting them inside String or some other enum namespace is probably not
worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Zachary Waldowski) #16

Today's String already supports this through
`String.decodeCString(_:as:repairingInvalidCodeUnits:)`, passing a
buffer pointer.

Best,

  Zachary Waldowski

  zach@waldowski.me

Links:

  1. http://stackoverflow.com/a/27456220/251153

···

On Thu, Mar 30, 2017, at 12:35 PM, Félix Cloutier via swift-evolution wrote:

I don't have much non-nitpick issues that I greatly care about; I'm in
favor of this.

My only request: it's currently painful to create a String from a fixed-
size C array. For instance, if I have a pointer to a `struct foo {
char name[16]; }` in Swift where the last character doesn't have to be
a NUL, it's hard to create a String from it. Real-world examples of
this are Mach-O LC_SEGMENT and LC_SEGMENT_64 commands.

The generally-accepted wisdom[1] is that you take a pointer to the
CChar tuple that represents the fixed-size array, but this still
requires the string to be NUL-terminated. What do we think of an
additional init(cString:) overload that takes an UnsafeBufferPointer
and reads up to the first NUL or the end of the buffer, whichever
comes first?


(Jean-Daniel) #17

I’m with you for a C intro API that support taking a non-null terminated string. I often work with API that support efficient parsing by providing pointer to a global buffer + length to report parsed strings.

Without a way to create a Swift string from buffer + length, interop with such API will be difficult for no good reason, as Swift string don’t event have to be null terminated.

···

Le 30 mars 2017 à 18:35, Félix Cloutier via swift-evolution <swift-evolution@swift.org> a écrit :

I don't have much non-nitpick issues that I greatly care about; I'm in favor of this.

My only request: it's currently painful to create a String from a fixed-size C array. For instance, if I have a pointer to a `struct foo { char name[16]; }` in Swift where the last character doesn't have to be a NUL, it's hard to create a String from it. Real-world examples of this are Mach-O LC_SEGMENT and LC_SEGMENT_64 commands.

The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> is that you take a pointer to the CChar tuple that represents the fixed-size array, but this still requires the string to be NUL-terminated. What do we think of an additional init(cString:) overload that takes an UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, whichever comes first?

Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :

On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

Really great stuff, guys. Thanks for your work on this!

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.

If not, then given this:

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.

/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
/// The maximum number of code units in an encoded unicode scalar value
static var maxLengthOfEncodedScalar: Int { get }

/// A type that can represent a single UnicodeScalar as it is encoded in this
/// encoding.
associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?

@discardableResult
public static func parseForward<C: Collection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)

@discardableResult
public static func parseReverse<C: BidirectionalCollection>(
   _ input: C,
   repairingIllFormedSequences makeRepairs: Bool = true,
   into output: (EncodedScalar) throws->Void
) rethrows -> (remainder: C.SubSequence, errorCount: Int)
where C.SubSequence : BidirectionalCollection,
       C.SubSequence.SubSequence == C.SubSequence,
       C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}

Are there constraints missing on `parseForward`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Ben Cohen) #18

We-eelll, there is “Stringy”….

As tempting as it is to call the protocol this, it’s probably not a good idea.

(then again, if we called it Text instead of String, we could then call the subsequence Subtext…)

···

On Mar 30, 2017, at 11:20 AM, Brent Royal-Gordon <brent@architechies.com> wrote:

(That's why there's no adjective form of "string", which makes naming the protocol difficult.)


(Adrian Zubarev) #19

If I had to choose as a not native English speaker I’d go for SubString just for the camel case consistency across all other types.

We cannot rename SubSequence to Subsequence, because that would be odd compared to all other types containing Sequence.

AnySequence
LazyPrefixWhileSequence
LazySequence
EnumeratedSequence
etc.
This won’t break anything or create any other inconsistency.

···

--
Adrian Zubarev
Sent with Airmail

Am 30. März 2017 um 17:51:09, Joshua Alvarado via swift-evolution (swift-evolution@swift.org) schrieb:

...my vote would be to lowercase Subsequence. We can typealias SubSequence = Subsequence to aid migration

+1 didn't think that was an option. A good solution would be to have them either camel case (SubString, SubSequence) or just capitalized (Substring, Substring) either would be nice as long as they were matching.

On Thu, Mar 30, 2017 at 9:38 AM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

On Mar 29, 2017, at 6:59 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

This looks great. The restored conformances to *Collection will be huge.

Is this to be the first of several or the only major part of the manifesto to be delivered in Swift 4?

First of several. This lays the ground work for the changes to the underlying implementation. Other changes will mostly be additive on top.

Nits on naming: are we calling it Substring or SubString (à la SubSequence)?

This is venturing into subjective territory, so these are just my feelings rather than something definitive (Dave may differ) but:

It should definitely be Substring. My rule of thumb: if you might hyphenate it, you can capitalize it. I don’t think anyone spells it "sub-string". OTOH one might write "sub-sequence". Generally hyphens disappear in english as things come into common usage i.e. it used to be e-mail but now it’s mostly just email. Substring is enough of a term of art in programming that this has happened. Admittedly, Subsequence is a term of art too – unfortunately one that has a different meaning to ours ("a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements" e.g. <A,C,E> is a Subsequence of <A,B,C,D,E> – see https://en.wikipedia.org/wiki/Subsequence). Even worse, the mathematical term for what we are calling a subsequence is a Substring!

If we were change anything, my vote would be to lowercase Subsequence. We can typealias SubSequence = Subsequence to aid migration, with a slow burn on deprecating it since it’ll be quite a footling deprecation. I don’t know if it’s worth it though – the main use of “SubSequence” is currently in those pesky where clauses you have to put on all your Collection extensions if you want to use slicing, and many of these will be eliminated once we have the ability to put where clauses on associated types.

and shouldn't it be UnicodeParsedResult rather than UnicodeParseResult?

I think Parse. As in, this is the result of a parse, not these are the parsed results (though it does contain parsed results in some cases, but not all).

On Wed, Mar 29, 2017 at 19:32 Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:
Hi Swift Evolution,

Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.

Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

String Revision: Collection Conformance, C Interop, Transcoding

Proposal: SE-0161
Authors: Ben Cohen, Dave Abrahams
Review Manager: TBD
Status: Awaiting review
Introduction

This proposal is to implement a subset of the changes from the Swift 4 String Manifesto.

Specifically:

Make String conform to BidirectionalCollection
Make String conform to RangeReplaceableCollection
Create a Substring type for String.SubSequence
Create a Unicode protocol to allow for generic operations over both types.
Consolidate on a concise set of C interop methods.
Revise the transcoding infrastructure.
Other existing aspects of String remain unchanged for the purposes of this proposal.

Motivation

This proposal follows up on a number of recommendations found in the manifesto:

Collection conformance was dropped from String in Swift 2. After reevaluation, the feeling is that the minor semantic discrepancies (mainly with RangeReplaceableCollection) are outweighed by the significant benefits of restoring these conformances. For more detail on the reasoning, see here

While it is not a collection, the Swift 3 string does have slicing operations. String is currently serving as its own subsequence, allowing substrings to share storage with their “owner”. This can lead to memory leaks when small substrings of larger strings are stored long-term (see here for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with a simpler set of methods on String.

Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.

In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings. The primary change is to remove “non-repairing” variants of construction from nul-terminated C strings. In both of the construction APIs, any invalid encoding sequence detected will have its longest valid prefix replaced by U+FFFD, the Unicode replacement character, per the Unicode specification. This covers the common case. The replacement is done physically in the underlying storage and the validity of the result is recorded in the String’s encoding such that future accesses need not be slowed down by possible error repair separately. Construction that is aborted when encoding errors are detected can be accomplished using APIs on the encoding.

The current transcoding support will be updated to improve usability and performance. The primary changes will be:

to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
to add the ability to transcode an input collection in reverse, allowing the different views on String to be made bi-directional
to have decoding take a collection rather than an iterator, and return an index of its progress into the source, allowing that method to be static
The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Detailed design

The following additions will be made to the standard library:

protocol Unicode: BidirectionalCollection {
  // Implementation detail as described above
}

extension String: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
}

struct Substring: Unicode, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}
The subscript operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<String.Index>) -> Substring { get }
  subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
}
Note that properties or methods that due to their nature create new String storage (such as lowercased()) will not change.

C string interop will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
  /// bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
   
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  /// the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  /// should be interpreted.
  init<Encoding: UnicodeEncoding>(
    cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    encoding: Encoding)
     
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
Additionally, the current ability to pass a Swift String into C methods that take a C string will remain as-is.

A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:

public enum UnicodeParseResult<T, Index> {
/// Indicates valid input was recognized.
///
/// `resumptionPoint` is the end of the parsed region
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
/// Indicates invalid input was recognized.
///
/// `resumptionPoint` is the next position at which to continue parsing after
/// the invalid input is repaired.
case error(resumptionPoint: Index)

/// Indicates that there was no more input to consume.
case emptyInput

  /// If any input was consumed, the point from which to continue parsing.
  var resumptionPoint: Index? {
    switch self {
    case .valid(_,let r): return r
    case .error(let r): return r
    case .emptyInput: return nil
    }
  }
}

/// An encoding for text with UnicodeScalar as a common currency type
public protocol UnicodeEncoding {
  /// The maximum number of code units in an encoded unicode scalar value
  static var maxLengthOfEncodedScalar: Int { get }
   
  /// A type that can represent a single UnicodeScalar as it is encoded in this
  /// encoding.
  associatedtype EncodedScalar : EncodedScalarProtocol

  /// Produces a scalar of this encoding if possible; returns `nil` otherwise.
  static func encode<Scalar: EncodedScalarProtocol>(
    _:Scalar) -> Self.EncodedScalar?
   
  /// Parse a single unicode scalar forward from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarForward<C: Collection>(
    _ input: C, knownCount: Int /* = 0, via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element

  /// Parse a single unicode scalar in reverse from `input`.
  ///
  /// - Parameter knownCount: a number of code units known to exist in `input`.
  /// **Note:** passing a known compile-time constant is strongly advised,
  /// even if it's zero.
  static func parseScalarReverse<C: BidirectionalCollection>(
    _ input: C, knownCount: Int /* = 0 , via extension */
  ) -> ParseResult<EncodedScalar, C.Index>
  where C.Iterator.Element == EncodedScalar.Iterator.Element
}

/// Parsing multiple unicode scalar values
extension UnicodeEncoding {
  @discardableResult
  public static func parseForward<C: Collection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
   
  @discardableResult
  public static func parseReverse<C: BidirectionalCollection>(
    _ input: C,
    repairingIllFormedSequences makeRepairs: Bool = true,
    into output: (EncodedScalar) throws->Void
  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
  where C.SubSequence : BidirectionalCollection,
        C.SubSequence.SubSequence == C.SubSequence,
        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
}
UnicodeCodec will be updated to refine UnicodeEncoding, and all existing codecs will conform to it.

Note, depending on whether this change lands before or after some of the generics features, generic where clauses may need to be added temporarily.

Source compatibility

Adding collection conformance to String should not materially impact source stability as it is purely additive: Swift 3’s String interface currently fulfills all of the requirements for a bidirectional range replaceable collection.

Altering String’s slicing operations to return a different type is source breaking. The following mitigating steps are proposed:

Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a String not a Substring.

Add deprecated versions of all current slicing methods to similarly return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}
In a review of 77 popular Swift projects found on GitHub, these changes resolved any build issues in the 12 projects that assumed an explicit String type returned from slicing operations.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

The C string interoperability methods outside the ones described in the detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and be removed in a subsequent release. UnicodeCodec will be similarly deprecated.

Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type (and its associated subsequence) is in a good long-term state before being locked down when Swift declares ABI stability. Shrinking the size of String to be 64 bits is an important part of this.

Effect on API resilience

Decisions about the API resilience of the String type are still to be determined, but are not adversely affected by this proposal.

Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Joshua Alvarado
alvaradojoshua0@gmail.com
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Félix Cloutier) #20

Does it? According to the documentation for the current decodeCString <https://developer.apple.com/reference/swift/string/1641442-decodecstring>, it seems to accept an UnsafePointer, not a buffer pointer, and expects the string to be null-terminated. Am I missing another overload?

···

Le 30 mars 2017 à 17:27, Zach Waldowski via swift-evolution <swift-evolution@swift.org> a écrit :

On Thu, Mar 30, 2017, at 12:35 PM, Félix Cloutier via swift-evolution wrote:

I don't have much non-nitpick issues that I greatly care about; I'm in favor of this.

My only request: it's currently painful to create a String from a fixed-size C array. For instance, if I have a pointer to a `struct foo { char name[16]; }` in Swift where the last character doesn't have to be a NUL, it's hard to create a String from it. Real-world examples of this are Mach-O LC_SEGMENT and LC_SEGMENT_64 commands.

The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> is that you take a pointer to the CChar tuple that represents the fixed-size array, but this still requires the string to be NUL-terminated. What do we think of an additional init(cString:) overload that takes an UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, whichever comes first?

Today's String already supports this through `String.decodeCString(_:as:repairingInvalidCodeUnits:)`, passing a buffer pointer.

Best,
  Zachary Waldowski
  zach@waldowski.me <mailto:zach@waldowski.me>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution