Faster/lower-level external String initialization

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:

[1]:

[2]:

[3]:

[4]:

Cheers,
Zachary Waldowski
zach@waldowski.me

I'd like to see _fromCodeUnitSequence [2] become public API

I am very much in favor of this. I have had *exactly* the same experience.

String.reserveCapacity() seems to act like a no-op for some reason so append() is incredibly slow, and fromCString() often necessitates a copy to an intermediate buffer because of the the null-byte requirement.

This has been one of the weakest areas of Swift performance for me.

-CK

···

On Jan 8, 2016, at 12:21 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Complete agreement from me. I would like to see a String constructor from a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a fallible constructor?

Guillaume Lessard

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

    https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

    https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

···

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me

I support this change as well.

TJ

···

On Fri, Jan 8, 2016 at 6:06 PM, Guillaume Lessard via swift-evolution < swift-evolution@swift.org> wrote:

Complete agreement from me. I would like to see a String constructor from
a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a
fallible constructor?

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Zach,

Thanks very much for writing up this proposal! This will be a very valuable addition to the standard library for some of us. My comments are below:

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

   https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

   https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats,

binary *and* text file formats!

wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

It might be worth mentioning here in the Motivation section that String.append(_: UnicodeScalar) is not a viable alternative in many cases because it has much slower performance. (I know it is discussed below under alternatives.)

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

This also means that something as fundamental as parsing sub-strings out of an NSData object requires copying to intermediate buffers or the use of much slower character-by-character appends.

Another limitation is that `fromCString` only works with UTF8 (or ASCII) encoding.

It is worth mentioning also that the implementation of fromCString() involves a string length calculation (call to strlen()). In many cases that length has already been calculated in the client code. The proposed solution has the potential of being at least slightly faster because the strlen call is not needed. Maybe this should go in the Proposed Solution section.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

These two functions seem like a good approach. The only alternatives I can think of are either to have a `withRepair: Bool` parameter to the initializer (possibly with a default value) or to make the initializer a type method instead for complete consistency with fromCString() and fromCStringRepairingIllFormedUTF8().

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

And performance is extremely important in many file parsing scenarios because the size of the input files is unpredictable (and often large!).

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

Even if the performance problems here are fixed, relying on String.append() would still lead to more verbose code than the proposed direct initializer or factory function.

···

On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Hi Zach,

We looked at the CString APIs as part of API Naming Guidelines application effort.
You can see the results here: https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b

The main idea is to turn static factories into initializers and make init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code units.

Haven’t looked at your proposal in details, but I think that if we add a new String.decodeCString that accepts an UnsafeBufferPointer instead of an UnsafePointer (and does not have to call _swift_stdlib_strlen), that would solve the problem. Unless I’m missing something.

regards,
max

···

On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

   https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

   https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Sorry, I didn't get it, are you talking about:

var bytes : [UInt8] = [65, 66, 67, 68, 69, 70];

var s = String.init(bytes: bytes, encoding: NSUTF8StringEncoding)

?

···

On Sat, Jan 9, 2016 at 8:22 AM, T.J. Usiyan via swift-evolution < swift-evolution@swift.org> wrote:

I support this change as well.

TJ

On Fri, Jan 8, 2016 at 6:06 PM, Guillaume Lessard via swift-evolution < > swift-evolution@swift.org> wrote:

Complete agreement from me. I would like to see a String constructor from
a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a
fallible constructor?

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Best Regards!

Yang Wu
--------------------------------------------------------
Location: Pudong, Shanghai, China.
EMail : pinxue@gmail.com
Website: http://www.time2change.mobi http://rockplayer.com
Twitter/Weibo : @pinxue
<http://www.pinxue.net>

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

There probably were no failable initializers when it was first implemented. The other thing is `fromCStringRepairingIllFormedUTF8` returns a tuple, so cannot be an initializer.

···

On Jan 12, 2016, at 11:18 AM, Charles Kissinger via swift-evolution <swift-evolution@swift.org> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

Hi Zach,

We looked at the CString APIs as part of API Naming Guidelines application effort.
You can see the results here: https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b

The main idea is to turn static factories into initializers and make init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code units.

Haven’t looked at your proposal in details, but I think that if we add a new String.decodeCString that accepts an UnsafeBufferPointer instead of an UnsafePointer (and does not have to call _swift_stdlib_strlen), that would solve the problem. Unless I’m missing something.

That would solve my particular problems anyway. Will a proposal still be required for this to happen?

-CK

···

On Jan 12, 2016, at 11:57 AM, Max Moiseev via swift-evolution <swift-evolution@swift.org> wrote:

regards,
max

On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

  https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

  https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Max,

Seems like a fantastic change, if indeed the move is made from
UnsafePointer to UnsafeBufferPointer! That still doesn't cover the case
where you'd be doing code-unit level transforms (i.e., for custom
encoding schemes in some formats, like the Unicode escapes in JSON), but
that can probably also be done at the String level after-the-fact.

Awesome change, though! It'd be a shame to have to wait until 3.0 for it
to land.

···

--
Zach Waldowski
zach@waldowski.me

On Tue, Jan 12, 2016, at 02:57 PM, Max Moiseev wrote:

Hi Zach,

We looked at the CString APIs as part of API Naming Guidelines
application effort.
You can see the results here:
revisiting CString related String extensions · apple/swift@f4aaece · GitHub

The main idea is to turn static factories into initializers and make
init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code
units.

Haven’t looked at your proposal in details, but I think that if we add a
new String.decodeCString that accepts an UnsafeBufferPointer instead of
an UnsafePointer (and does not have to call _swift_stdlib_strlen), that
would solve the problem. Unless I’m missing something.

regards,
max

> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:
>
> Given the initial positive response, I've taken a crack both at
> implementation and converting the request to a proposal. The proposal
> draft is located at:
>
> https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
>
> The code is located at:
>
> https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
>
> The proposal is reproduced below:
>
> # Expose code unit initializers on String
>
> * Proposal:
> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
> * Author: [Zachary Waldowski](https://github.com/zwaldowski\)
> * Status: **Awaiting review**
> * Review manager: TBD
>
> ## Introduction
>
> Going back and forth from Strings to their byte representations is an
> important part of solving many problems, including object
> serialization, binary file formats, wire/network interfaces, and
> cryptography. Swift has such utilities, currently only exposed through
> `String.Type.fromCString(_:)`.
>
> See swift-evolution
> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).
>
> ## Motivation
>
> In developing a parser, a coworker did the yeoman's work of benchmarking
> Swift's Unicode types. He swore up and down that
> `String.Type.fromCString(_:)`
> ([use](String initialization notes · GitHub))
> was the fastest way he found. I, stubborn and noobish as I am, was
> skeptical that a better way couldn't be wrought from Swift's
> `UnicodeCodecType`s.
>
> After reading through stdlib source and doing my own testing, this is no
> wives'
> tale. `fromCString` is essentially the only public-facing user of
> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> role of
> both efficient and safe initialization-by-buffer-copy.
>
> Of course, `fromCString` isn't a silver bullet; it has to have a null
> sentinel,
> requiring a copy of the origin buffer if one needs to be added (as is
> the
> case with formats that specify the length up front, or unstructured
> payloads
> that use unescaped double quotes as the terminator). It also prevents
> the string itself from containing the null character.
>
> # Proposed solution
>
> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> public API:
>
> ```swift
> init?<Input: CollectionType, Encoding: UnicodeCodecType where
> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> encoding: Encoding.Type)
> ```
>
> And, for consistency with
> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
>
> ```swift
> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> UnicodeCodecType where Encoding.CodeUnit ==
> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
>
> ## Detailed design
>
> See [full
> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).
>
> This is a fairly straightforward renaming of the internal APIs.
>
> The initializer, its labels, and their order were chosen to match other
> non-cast
> initializers in the stdlib. "Sequence" was removed, as it was a
> misnomer.
> "input" was kept as a generic name in order to allow for future
> refinements.
>
> The static initializer made the same changes, but was otherwise kept as
> a
> factory function due to its multiple return values.
>
> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> for
> internal use. I assume it wouldn't be good to expose publicly because,
> for
> lack of a better phrase, we only "trust" the stdlib to accurately know
> the
> wellformedness of their code units. Since it is a simple call through,
> its
> use could be elided throughout the stdlib.
>
> ## Impact on existing code
>
> This is an additive change to the API.
>
> ## Alternatives considered
>
> * A protocol-oriented API.
>
> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> really
> clear this method would be related to string processing, and would
> require
> some kind of bounding (like `where Generator.Element:
> UnsignedIntegerType`), but
> that would be introducing a type bound that doesn't exist on
>
> * Do nothing.
>
> This seems suboptimal. For many use cases, `String` lacking this
> constructor is
> a limiting factor on performance for many kinds of pure-Swift
> implementations.
>
> * Make the `NSString` [bridge
> faster](String initialization notes · GitHub).
>
> After reading the bridge code, I don't really know why it's slower.
> Maybe it's
> a bug.
>
> * Make `String.append(_:)`
> [faster](String initialization notes · GitHub).
>
> I don't completely understand the growth strategy of `_StringCore`, but
> it doesn't seem to exhibit the documented amortized `O(1)`, even when
> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> noted that
> it seems like `reserveCapacity` acts like a no-op.
>
> ----
>
> Cheers,
> Zachary Waldowski
> zach@waldowski.me
>
> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>> Going back and forth from Strings to their byte representations is an
>> important part of solving many problems, including object
>> serialization, binary file formats, wire/network interfaces, and
>> cryptography.
>>
>> In developing such a parser, a coworker did the yeoman's work of
>> benchmarking
>> Swift's Unicode types. He swore up and down that
>> String.Type.fromCString(_:) [0]
>> was the fastest way he found. I, stubborn and noobish as I am, was
>> skeptical
>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>>
>> After reading through stdlib source and doing my own testing, this is no
>> wives'
>> tale. fromCString [1] is essentially the only public user of
>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>> of
>> both efficient and safe initialization-by-buffer-copy.
>>
>> Of course, fromCString isn't a silver bullet; it has to have a null
>> sentinel,
>> requiring a copy of the origin buffer if one needs to be added (as is
>> the
>> case with formats that specify the length up front, or unstructured
>> payloads
>> that use unescaped double quotes as the terminator). It also prevents
>> the string
>> itself from containing the null character.
>>
>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>> If that
>> can't happen, an alternative to fromCString that doesn't use strlen
>> would be
>> nice, and we can just eat the performance hit on other code unit
>> sequences.
>>
>> I can't really think of a reason why it's not exposed yet, so I'm led to
>> believe
>> I'm just missing something major, and not that a reason doesn't exist.
>> ;-)
>>
>> There's also discussion to be had of if API is needed. Try as I might, I
>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>> have
>> anything close to the same speed. [3] Profiling indicates that I keep
>> hitting
>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>> referenced, or it's a bug, or what, but it's consistently slower than
>> creating
>> an Array of the bytes and performing fromCString on it. Similar story
>> with
>> crossing the NSString bridge, which is even stranger. [4]
>>
>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>> whether
>> this can be turned into a proposal.
>>
>> [0]:
>> String initialization notes · GitHub
>> [1]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>> [2]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>> [3]:
>> String initialization notes · GitHub
>> [4]:
>> String initialization notes · GitHub
>>
>> Cheers,
>> Zachary Waldowski
>> zach@waldowski.me
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

Though Max's follow-up might call into question the need for the
proposal (in a perfect world I'd like to see this in 2.2), I've
addressed your comments. Thanks!

···

--
Zach Waldowski
zach@waldowski.me

On Tue, Jan 12, 2016, at 02:18 PM, Charles Kissinger wrote:

Zach,

Thanks very much for writing up this proposal! This will be a very
valuable addition to the standard library for some of us. My comments are
below:

> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:
>
> Given the initial positive response, I've taken a crack both at
> implementation and converting the request to a proposal. The proposal
> draft is located at:
>
> https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
>
> The code is located at:
>
> https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
>
> The proposal is reproduced below:
>
> # Expose code unit initializers on String
>
> * Proposal:
> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
> * Author: [Zachary Waldowski](https://github.com/zwaldowski\)
> * Status: **Awaiting review**
> * Review manager: TBD
>
> ## Introduction
>
> Going back and forth from Strings to their byte representations is an
> important part of solving many problems, including object
> serialization, binary file formats,

binary *and* text file formats!

> wire/network interfaces, and
> cryptography. Swift has such utilities, currently only exposed through
> `String.Type.fromCString(_:)`.
>
> See swift-evolution
> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).
>
> ## Motivation
>
> In developing a parser, a coworker did the yeoman's work of benchmarking
> Swift's Unicode types. He swore up and down that
> `String.Type.fromCString(_:)`
> ([use](String initialization notes · GitHub))
> was the fastest way he found. I, stubborn and noobish as I am, was
> skeptical that a better way couldn't be wrought from Swift's
> `UnicodeCodecType`s.
>
> After reading through stdlib source and doing my own testing, this is no
> wives'
> tale. `fromCString` is essentially the only public-facing user of
> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> role of
> both efficient and safe initialization-by-buffer-copy.

It might be worth mentioning here in the Motivation section that
String.append(_: UnicodeScalar) is not a viable alternative in many cases
because it has much slower performance. (I know it is discussed below
under alternatives.)

>
> Of course, `fromCString` isn't a silver bullet; it has to have a null
> sentinel,
> requiring a copy of the origin buffer if one needs to be added (as is
> the
> case with formats that specify the length up front, or unstructured
> payloads
> that use unescaped double quotes as the terminator). It also prevents
> the string itself from containing the null character.

This also means that something as fundamental as parsing sub-strings out
of an NSData object requires copying to intermediate buffers or the use
of much slower character-by-character appends.

Another limitation is that `fromCString` only works with UTF8 (or ASCII)
encoding.

It is worth mentioning also that the implementation of fromCString()
involves a string length calculation (call to strlen()). In many cases
that length has already been calculated in the client code. The proposed
solution has the potential of being at least slightly faster because the
strlen call is not needed. Maybe this should go in the Proposed Solution
section.

>
> # Proposed solution
>
> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> public API:
>
> ```swift
> init?<Input: CollectionType, Encoding: UnicodeCodecType where
> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> encoding: Encoding.Type)
> ```
>
> And, for consistency with
> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
>
> ```swift
> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> UnicodeCodecType where Encoding.CodeUnit ==
> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
>

These two functions seem like a good approach. The only alternatives I
can think of are either to have a `withRepair: Bool` parameter to the
initializer (possibly with a default value) or to make the initializer a
type method instead for complete consistency with fromCString() and
fromCStringRepairingIllFormedUTF8().

It would be nice to get some feedback from someone at Apple as to why
fromCString() was implemented as a type method instead of a failable
initializer. Presumably it was because there is both a repairing and a
failable, non-repairing version.

> ## Detailed design
>
> See [full
> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).
>
> This is a fairly straightforward renaming of the internal APIs.
>
> The initializer, its labels, and their order were chosen to match other
> non-cast
> initializers in the stdlib. "Sequence" was removed, as it was a
> misnomer.
> "input" was kept as a generic name in order to allow for future
> refinements.
>
> The static initializer made the same changes, but was otherwise kept as
> a
> factory function due to its multiple return values.
>
> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> for
> internal use. I assume it wouldn't be good to expose publicly because,
> for
> lack of a better phrase, we only "trust" the stdlib to accurately know
> the
> wellformedness of their code units. Since it is a simple call through,
> its
> use could be elided throughout the stdlib.
>
> ## Impact on existing code
>
> This is an additive change to the API.
>
> ## Alternatives considered
>
> * A protocol-oriented API.
>
> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> really
> clear this method would be related to string processing, and would
> require
> some kind of bounding (like `where Generator.Element:
> UnsignedIntegerType`), but
> that would be introducing a type bound that doesn't exist on
>
> * Do nothing.
>
> This seems suboptimal. For many use cases, `String` lacking this
> constructor is
> a limiting factor on performance for many kinds of pure-Swift
> implementations.

And performance is extremely important in many file parsing scenarios
because the size of the input files is unpredictable (and often large!).

> * Make the `NSString` [bridge
> faster](String initialization notes · GitHub).
>
> After reading the bridge code, I don't really know why it's slower.
> Maybe it's
> a bug.
>
> * Make `String.append(_:)`
> [faster](String initialization notes · GitHub).
>
> I don't completely understand the growth strategy of `_StringCore`, but
> it doesn't seem to exhibit the documented amortized `O(1)`, even when
> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> noted that
> it seems like `reserveCapacity` acts like a no-op.

Even if the performance problems here are fixed, relying on
String.append() would still lead to more verbose code than the proposed
direct initializer or factory function.

> ----
>
> Cheers,
> Zachary Waldowski
> zach@waldowski.me
>
> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>> Going back and forth from Strings to their byte representations is an
>> important part of solving many problems, including object
>> serialization, binary file formats, wire/network interfaces, and
>> cryptography.
>>
>> In developing such a parser, a coworker did the yeoman's work of
>> benchmarking
>> Swift's Unicode types. He swore up and down that
>> String.Type.fromCString(_:) [0]
>> was the fastest way he found. I, stubborn and noobish as I am, was
>> skeptical
>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>>
>> After reading through stdlib source and doing my own testing, this is no
>> wives'
>> tale. fromCString [1] is essentially the only public user of
>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>> of
>> both efficient and safe initialization-by-buffer-copy.
>>
>> Of course, fromCString isn't a silver bullet; it has to have a null
>> sentinel,
>> requiring a copy of the origin buffer if one needs to be added (as is
>> the
>> case with formats that specify the length up front, or unstructured
>> payloads
>> that use unescaped double quotes as the terminator). It also prevents
>> the string
>> itself from containing the null character.
>>
>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>> If that
>> can't happen, an alternative to fromCString that doesn't use strlen
>> would be
>> nice, and we can just eat the performance hit on other code unit
>> sequences.
>>
>> I can't really think of a reason why it's not exposed yet, so I'm led to
>> believe
>> I'm just missing something major, and not that a reason doesn't exist.
>> ;-)
>>
>> There's also discussion to be had of if API is needed. Try as I might, I
>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>> have
>> anything close to the same speed. [3] Profiling indicates that I keep
>> hitting
>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>> referenced, or it's a bug, or what, but it's consistently slower than
>> creating
>> an Array of the bytes and performing fromCString on it. Similar story
>> with
>> crossing the NSString bridge, which is even stranger. [4]
>>
>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>> whether
>> this can be turned into a proposal.
>>
>> [0]:
>> String initialization notes · GitHub
>> [1]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>> [2]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>> [3]:
>> String initialization notes · GitHub
>> [4]:
>> String initialization notes · GitHub
>>
>> Cheers,
>> Zachary Waldowski
>> zach@waldowski.me
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

The String initializer that would actually be of most value to me would be:

String.init(utf8: UnsafePointer<UInt8>, length: Int)

as long as it is as fast (preferably faster) than String.fromCString() and doesn’t require bridging to NSString. A similar initializer is available for NSString, but as Zach indicated in his link, using it is slow.

Several other initializers, including the one you showed, would be useful, as long as they are performant.

—CK

···

On Jan 8, 2016, at 9:04 PM, 品雪 via swift-evolution <swift-evolution@swift.org> wrote:

Sorry, I didn't get it, are you talking about:

var bytes : [UInt8] = [65, 66, 67, 68, 69, 70];
var s = String.init(bytes: bytes, encoding: NSUTF8StringEncoding)

?

On Sat, Jan 9, 2016 at 8:22 AM, T.J. Usiyan via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
I support this change as well.

TJ

On Fri, Jan 8, 2016 at 6:06 PM, Guillaume Lessard via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Complete agreement from me. I would like to see a String constructor from a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a fallible constructor?

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Best Regards!

Yang Wu
--------------------------------------------------------
Location: Pudong, Shanghai, China.
EMail : pinxue@gmail.com <mailto:pinxue@gmail.com>
Website: http://www.time2change.mobi <http://www.time2change.mobi/&gt; http://rockplayer.com <http://rockplayer.com/&gt;
Twitter/Weibo : @pinxue
<http://www.pinxue.net/&gt; _______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I think it would be more appropriate to use `init(utf8:
UnsafeBufferPointer<UInt8>)`, which comprises both a base pointer and a
length.

Jacob Bandes-Storch

···

On Fri, Jan 8, 2016 at 9:58 PM, Charles Kissinger via swift-evolution < swift-evolution@swift.org> wrote:

The String initializer that would actually be of most value to me would be:

String.init(utf8: UnsafePointer<UInt8>, length: Int)

as long as it is as fast (preferably faster) than String.fromCString() and
doesn’t require bridging to NSString. A similar initializer is available
for NSString, but as Zach indicated in his link, using it is slow.

Several other initializers, including the one you showed, would be useful,
as long as they are performant.

—CK

On Jan 8, 2016, at 9:04 PM, 品雪 via swift-evolution < > swift-evolution@swift.org> wrote:

Sorry, I didn't get it, are you talking about:

var bytes : [UInt8] = [65, 66, 67, 68, 69, 70];
var s = String.init(bytes: bytes, encoding: NSUTF8StringEncoding)

?

On Sat, Jan 9, 2016 at 8:22 AM, T.J. Usiyan via swift-evolution < > swift-evolution@swift.org> wrote:

I support this change as well.

TJ

On Fri, Jan 8, 2016 at 6:06 PM, Guillaume Lessard via swift-evolution < >> swift-evolution@swift.org> wrote:

Complete agreement from me. I would like to see a String constructor
from a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a
fallible constructor?

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Best Regards!

Yang Wu
--------------------------------------------------------
Location: Pudong, Shanghai, China.
EMail : pinxue@gmail.com
Website: http://www.time2change.mobi http://rockplayer.com
Twitter/Weibo : @pinxue
<http://www.pinxue.net/&gt;
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

I think it would be more appropriate to use `init(utf8: UnsafeBufferPointer<UInt8>)`, which comprises both a base pointer and a length.

Yes. Better.

···

On Jan 8, 2016, at 10:02 PM, Jacob Bandes-Storch <jtbandes@gmail.com> wrote:

Jacob Bandes-Storch

On Fri, Jan 8, 2016 at 9:58 PM, Charles Kissinger via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
The String initializer that would actually be of most value to me would be:

String.init(utf8: UnsafePointer<UInt8>, length: Int)

as long as it is as fast (preferably faster) than String.fromCString() and doesn’t require bridging to NSString. A similar initializer is available for NSString, but as Zach indicated in his link, using it is slow.

Several other initializers, including the one you showed, would be useful, as long as they are performant.

—CK

On Jan 8, 2016, at 9:04 PM, 品雪 via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Sorry, I didn't get it, are you talking about:

var bytes : [UInt8] = [65, 66, 67, 68, 69, 70];
var s = String.init(bytes: bytes, encoding: NSUTF8StringEncoding)

?

On Sat, Jan 9, 2016 at 8:22 AM, T.J. Usiyan via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
I support this change as well.

TJ

On Fri, Jan 8, 2016 at 6:06 PM, Guillaume Lessard via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Complete agreement from me. I would like to see a String constructor from a Sequence of code units.
Also, why is String.fromCString() a factory function rather than a fallible constructor?

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

--
Best Regards!

Yang Wu
--------------------------------------------------------
Location: Pudong, Shanghai, China.
EMail : pinxue@gmail.com <mailto:pinxue@gmail.com>
Website: http://www.time2change.mobi <http://www.time2change.mobi/&gt; http://rockplayer.com <http://rockplayer.com/&gt;
Twitter/Weibo : @pinxue
<http://www.pinxue.net/&gt; _______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

There probably were no failable initializers when it was first implemented. The other thing is `fromCStringRepairingIllFormedUTF8` returns a tuple, so cannot be an initializer.

Can the initializer take an inout parameter instead? Seems like it would be better to keep a consistent "initializer story."

···

On Jan 12, 2016, at 12:08 PM, Max Moiseev via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 12, 2016, at 11:18 AM, Charles Kissinger via swift-evolution <swift-evolution@swift.org> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Zach, Charles. I’ll try to reply to both of you in one shot.

As @gribozavr pointed out in a private conversation, `UnsafeBufferPointer` conforms to CollectionType, so we can generalize String.decodeCString to accept a CollectionType and constrain it precisely as you, Zach, did in your proposal.
(I remember there were some troubles with the fact that CChar is Int8 (signed) and UTF8.CodeUnit is UInt8, but that might not affect this new method).

I don’t quite understand what you mean by `custom code-unit level transforms’, but maybe having a CollectionType can address that.

As for the proposal. This does not have to wait until Swift 3. The change I pointed at was a side effect of revisiting all the APIs in stdlib. So if you guys feel strongly about this change (and I think you do, otherwise you wouldn’t go as far as writing a proposal document), you can take what’s in the swift-3-api-guidelines branch, implement the new method we’ve discussed, add some ‘deprecation’ magic to make it compatible with Swift 2.1 and run it through the evolution process.

max

···

On Jan 12, 2016, at 12:22 PM, Zach Waldowski <zach@waldowski.me> wrote:

Max,

Seems like a fantastic change, if indeed the move is made from
UnsafePointer to UnsafeBufferPointer! That still doesn't cover the case
where you'd be doing code-unit level transforms (i.e., for custom
encoding schemes in some formats, like the Unicode escapes in JSON), but
that can probably also be done at the String level after-the-fact.

Awesome change, though! It'd be a shame to have to wait until 3.0 for it
to land.

--
Zach Waldowski
zach@waldowski.me

On Tue, Jan 12, 2016, at 02:57 PM, Max Moiseev wrote:

Hi Zach,

We looked at the CString APIs as part of API Naming Guidelines
application effort.
You can see the results here:
revisiting CString related String extensions · apple/swift@f4aaece · GitHub

The main idea is to turn static factories into initializers and make
init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code
units.

Haven’t looked at your proposal in details, but I think that if we add a
new String.decodeCString that accepts an UnsafeBufferPointer instead of
an UnsafePointer (and does not have to call _swift_stdlib_strlen), that
would solve the problem. Unless I’m missing something.

regards,
max

On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

  https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

  https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Hi Alex,

If you mean that we still need to have initializers for both cases, we do. It’s just that in one of them (the repairing on) we throw away the information about whether repairs were made, which a) we don’t care in many cases and b) can still have using String.decodeCString.

Having an inout parameter in an initializer will break the (I think) common use case, where you get a CString from some C API, and want to call some Swift API that accepts String. I would do it like `swiftApi(String(cString))`, with inout it gets weird.

What do you think?

max

···

On Jan 12, 2016, at 1:00 PM, Alex Migicovsky <migi@apple.com> wrote:

On Jan 12, 2016, at 12:08 PM, Max Moiseev via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

There probably were no failable initializers when it was first implemented. The other thing is `fromCStringRepairingIllFormedUTF8` returns a tuple, so cannot be an initializer.

Can the initializer take an inout parameter instead? Seems like it would be better to keep a consistent "initializer story."

On Jan 12, 2016, at 11:18 AM, Charles Kissinger via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Max -

Great! Looking through it again, big +1 in favor of a
`repairIllFormedSequences: true` being the normal path.

I'm trying now to suss out the full gamut of methods that are needed, so
I can adapt the proposal + stdlib while backporting the changes from
3.0.

I'm in favor of two inits + decodeCString, with the latter sort of
becoming a "primitive". Just figuring out the best permutations of
those…

I might've mis-parsed the meaning of "'deprecation' magic". What's the
best path forward in the near-term? Would `decodeCString` be the only
one that becomes generic? Or, phrased differently, should there still be
`UnsafePointer<CChar>`+`strlen` versions?

···

--
Zach Waldowski
zach@waldowski.me

On Tue, Jan 12, 2016, at 07:54 PM, Max Moiseev wrote:

Zach, Charles. I’ll try to reply to both of you in one shot.

As @gribozavr pointed out in a private conversation,
`UnsafeBufferPointer` conforms to CollectionType, so we can generalize
String.decodeCString to accept a CollectionType and constrain it
precisely as you, Zach, did in your proposal.
(I remember there were some troubles with the fact that CChar is Int8
(signed) and UTF8.CodeUnit is UInt8, but that might not affect this new
method).

I don’t quite understand what you mean by `custom code-unit level
transforms’, but maybe having a CollectionType can address that.

As for the proposal. This does not have to wait until Swift 3. The change
I pointed at was a side effect of revisiting all the APIs in stdlib. So
if you guys feel strongly about this change (and I think you do,
otherwise you wouldn’t go as far as writing a proposal document), you can
take what’s in the swift-3-api-guidelines branch, implement the new
method we’ve discussed, add some ‘deprecation’ magic to make it
compatible with Swift 2.1 and run it through the evolution process.

max

> On Jan 12, 2016, at 12:22 PM, Zach Waldowski <zach@waldowski.me> wrote:
>
> Max,
>
> Seems like a fantastic change, if indeed the move is made from
> UnsafePointer to UnsafeBufferPointer! That still doesn't cover the case
> where you'd be doing code-unit level transforms (i.e., for custom
> encoding schemes in some formats, like the Unicode escapes in JSON), but
> that can probably also be done at the String level after-the-fact.
>
> Awesome change, though! It'd be a shame to have to wait until 3.0 for it
> to land.
>
> --
> Zach Waldowski
> zach@waldowski.me
>
> On Tue, Jan 12, 2016, at 02:57 PM, Max Moiseev wrote:
>> Hi Zach,
>>
>> We looked at the CString APIs as part of API Naming Guidelines
>> application effort.
>> You can see the results here:
>> revisiting CString related String extensions · apple/swift@f4aaece · GitHub
>>
>> The main idea is to turn static factories into initializers and make
>> init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code
>> units.
>>
>> Haven’t looked at your proposal in details, but I think that if we add a
>> new String.decodeCString that accepts an UnsafeBufferPointer instead of
>> an UnsafePointer (and does not have to call _swift_stdlib_strlen), that
>> would solve the problem. Unless I’m missing something.
>>
>> regards,
>> max
>>
>>> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:
>>>
>>> Given the initial positive response, I've taken a crack both at
>>> implementation and converting the request to a proposal. The proposal
>>> draft is located at:
>>>
>>> https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
>>>
>>> The code is located at:
>>>
>>> https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
>>>
>>> The proposal is reproduced below:
>>>
>>> # Expose code unit initializers on String
>>>
>>> * Proposal:
>>> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
>>> * Author: [Zachary Waldowski](https://github.com/zwaldowski\)
>>> * Status: **Awaiting review**
>>> * Review manager: TBD
>>>
>>> ## Introduction
>>>
>>> Going back and forth from Strings to their byte representations is an
>>> important part of solving many problems, including object
>>> serialization, binary file formats, wire/network interfaces, and
>>> cryptography. Swift has such utilities, currently only exposed through
>>> `String.Type.fromCString(_:)`.
>>>
>>> See swift-evolution
>>> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).
>>>
>>> ## Motivation
>>>
>>> In developing a parser, a coworker did the yeoman's work of benchmarking
>>> Swift's Unicode types. He swore up and down that
>>> `String.Type.fromCString(_:)`
>>> ([use](String initialization notes · GitHub))
>>> was the fastest way he found. I, stubborn and noobish as I am, was
>>> skeptical that a better way couldn't be wrought from Swift's
>>> `UnicodeCodecType`s.
>>>
>>> After reading through stdlib source and doing my own testing, this is no
>>> wives'
>>> tale. `fromCString` is essentially the only public-facing user of
>>> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
>>> role of
>>> both efficient and safe initialization-by-buffer-copy.
>>>
>>> Of course, `fromCString` isn't a silver bullet; it has to have a null
>>> sentinel,
>>> requiring a copy of the origin buffer if one needs to be added (as is
>>> the
>>> case with formats that specify the length up front, or unstructured
>>> payloads
>>> that use unescaped double quotes as the terminator). It also prevents
>>> the string itself from containing the null character.
>>>
>>> # Proposed solution
>>>
>>> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
>>> public API:
>>>
>>> ```swift
>>> init?<Input: CollectionType, Encoding: UnicodeCodecType where
>>> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
>>> encoding: Encoding.Type)
>>> ```
>>>
>>> And, for consistency with
>>> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
>>> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
>>>
>>> ```swift
>>> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
>>> UnicodeCodecType where Encoding.CodeUnit ==
>>> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
>>>
>>> ## Detailed design
>>>
>>> See [full
>>> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).
>>>
>>> This is a fairly straightforward renaming of the internal APIs.
>>>
>>> The initializer, its labels, and their order were chosen to match other
>>> non-cast
>>> initializers in the stdlib. "Sequence" was removed, as it was a
>>> misnomer.
>>> "input" was kept as a generic name in order to allow for future
>>> refinements.
>>>
>>> The static initializer made the same changes, but was otherwise kept as
>>> a
>>> factory function due to its multiple return values.
>>>
>>> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
>>> for
>>> internal use. I assume it wouldn't be good to expose publicly because,
>>> for
>>> lack of a better phrase, we only "trust" the stdlib to accurately know
>>> the
>>> wellformedness of their code units. Since it is a simple call through,
>>> its
>>> use could be elided throughout the stdlib.
>>>
>>> ## Impact on existing code
>>>
>>> This is an additive change to the API.
>>>
>>> ## Alternatives considered
>>>
>>> * A protocol-oriented API.
>>>
>>> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
>>> really
>>> clear this method would be related to string processing, and would
>>> require
>>> some kind of bounding (like `where Generator.Element:
>>> UnsignedIntegerType`), but
>>> that would be introducing a type bound that doesn't exist on
>>>
>>> * Do nothing.
>>>
>>> This seems suboptimal. For many use cases, `String` lacking this
>>> constructor is
>>> a limiting factor on performance for many kinds of pure-Swift
>>> implementations.
>>>
>>> * Make the `NSString` [bridge
>>> faster](String initialization notes · GitHub).
>>>
>>> After reading the bridge code, I don't really know why it's slower.
>>> Maybe it's
>>> a bug.
>>>
>>> * Make `String.append(_:)`
>>> [faster](String initialization notes · GitHub).
>>>
>>> I don't completely understand the growth strategy of `_StringCore`, but
>>> it doesn't seem to exhibit the documented amortized `O(1)`, even when
>>> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
>>> noted that
>>> it seems like `reserveCapacity` acts like a no-op.
>>>
>>> ----
>>>
>>> Cheers,
>>> Zachary Waldowski
>>> zach@waldowski.me
>>>
>>> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>>>> Going back and forth from Strings to their byte representations is an
>>>> important part of solving many problems, including object
>>>> serialization, binary file formats, wire/network interfaces, and
>>>> cryptography.
>>>>
>>>> In developing such a parser, a coworker did the yeoman's work of
>>>> benchmarking
>>>> Swift's Unicode types. He swore up and down that
>>>> String.Type.fromCString(_:) [0]
>>>> was the fastest way he found. I, stubborn and noobish as I am, was
>>>> skeptical
>>>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>>>>
>>>> After reading through stdlib source and doing my own testing, this is no
>>>> wives'
>>>> tale. fromCString [1] is essentially the only public user of
>>>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>>>> of
>>>> both efficient and safe initialization-by-buffer-copy.
>>>>
>>>> Of course, fromCString isn't a silver bullet; it has to have a null
>>>> sentinel,
>>>> requiring a copy of the origin buffer if one needs to be added (as is
>>>> the
>>>> case with formats that specify the length up front, or unstructured
>>>> payloads
>>>> that use unescaped double quotes as the terminator). It also prevents
>>>> the string
>>>> itself from containing the null character.
>>>>
>>>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>>>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>>>> If that
>>>> can't happen, an alternative to fromCString that doesn't use strlen
>>>> would be
>>>> nice, and we can just eat the performance hit on other code unit
>>>> sequences.
>>>>
>>>> I can't really think of a reason why it's not exposed yet, so I'm led to
>>>> believe
>>>> I'm just missing something major, and not that a reason doesn't exist.
>>>> ;-)
>>>>
>>>> There's also discussion to be had of if API is needed. Try as I might, I
>>>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>>>> have
>>>> anything close to the same speed. [3] Profiling indicates that I keep
>>>> hitting
>>>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>>>> referenced, or it's a bug, or what, but it's consistently slower than
>>>> creating
>>>> an Array of the bytes and performing fromCString on it. Similar story
>>>> with
>>>> crossing the NSString bridge, which is even stranger. [4]
>>>>
>>>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>>>> whether
>>>> this can be turned into a proposal.
>>>>
>>>> [0]:
>>>> String initialization notes · GitHub
>>>> [1]:
>>>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>>>> [2]:
>>>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>>>> [3]:
>>>> String initialization notes · GitHub
>>>> [4]:
>>>> String initialization notes · GitHub
>>>>
>>>> Cheers,
>>>> Zachary Waldowski
>>>> zach@waldowski.me
>>> _______________________________________________
>>> swift-evolution mailing list
>>> swift-evolution@swift.org
>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>

I was trying to say that any tuple returning factory method can be turned into an initializer with an inout param. e.g.

struct F {
     func makeF() -> (F?, Int)
}

can be made into:

struct F {
     init?(inout result: Int) { … }
}

I think you should still be able to call an initializer like that with your `swiftApi(String(cString))` example, right? It would just be `swiftApi(String(cString, foo: &otherTupleValue). I thought the proposed alternative would look more like `swiftApi(String.fromCString(cString).0)` (I’ve lost track at this point about what the exact API proposal is, sorry).

With this approach every time you need to create a String you go through a String initializer—you don’t need to think if it’s a factory method or an initializer. That’s what I was trying to get at about keeping the “initializer story" consistent.

- Alex

···

On Jan 12, 2016, at 5:01 PM, Max Moiseev <moiseev@apple.com> wrote:

Hi Alex,

If you mean that we still need to have initializers for both cases, we do. It’s just that in one of them (the repairing on) we throw away the information about whether repairs were made, which a) we don’t care in many cases and b) can still have using String.decodeCString.

Having an inout parameter in an initializer will break the (I think) common use case, where you get a CString from some C API, and want to call some Swift API that accepts String. I would do it like `swiftApi(String(cString))`, with inout it gets weird.

What do you think?

max

On Jan 12, 2016, at 1:00 PM, Alex Migicovsky <migi@apple.com <mailto:migi@apple.com>> wrote:

On Jan 12, 2016, at 12:08 PM, Max Moiseev via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

There probably were no failable initializers when it was first implemented. The other thing is `fromCStringRepairingIllFormedUTF8` returns a tuple, so cannot be an initializer.

Can the initializer take an inout parameter instead? Seems like it would be better to keep a consistent "initializer story."

On Jan 12, 2016, at 11:18 AM, Charles Kissinger via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version.

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution