Faster/lower-level external String initialization

zwaldowski · January 11, 2016, 9:56pm

Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md\)
* Author: [Zachary Waldowski](https://github.com/zwaldowski\)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html\).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](String initialization notes · GitHub))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units\).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](String initialization notes · GitHub).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](String initialization notes · GitHub).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

···

----

Cheers,
Zachary Waldowski
zach@waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography.

In developing such a parser, a coworker did the yeoman's work of
benchmarking
Swift's Unicode types. He swore up and down that
String.Type.fromCString(_:) [0]
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical
that a better way couldn't be wrought from Swift's UnicodeCodecTypes.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. fromCString [1] is essentially the only public user of
String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
of
both efficient and safe initialization-by-buffer-copy.

Of course, fromCString isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string
itself from containing the null character.

I'd like to see _fromCodeUnitSequence [2] become public API as (just
spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
If that
can't happen, an alternative to fromCString that doesn't use strlen
would be
nice, and we can just eat the performance hit on other code unit
sequences.

I can't really think of a reason why it's not exposed yet, so I'm led to
believe
I'm just missing something major, and not that a reason doesn't exist.
;-)

There's also discussion to be had of if API is needed. Try as I might, I
can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
have
anything close to the same speed. [3] Profiling indicates that I keep
hitting
_StringBuffer.grow. I don't know if that means the buffer isn't uniquely
referenced, or it's a bug, or what, but it's consistently slower than
creating
an Array of the bytes and performing fromCString on it. Similar story
with
crossing the NSString bridge, which is even stranger. [4]

Anyway, I wanted to stir up discussion, see if I'm way off base and/or
whether
this can be turned into a proposal.

[0]:
String initialization notes · GitHub
[1]:
https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
[2]:
https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
[3]:
String initialization notes · GitHub
[4]:
String initialization notes · GitHub

Cheers,
Zachary Waldowski
zach@waldowski.me