A proposal to add new String
failable initializers that validate encoded input, and return nil
when the input contains any invalid elements.
String initializers with encoding validation
- Proposal: SE-NNNN String initializers with encoding validation
- Author: Guillaume Lessard
- Review Manager: TBD
- Status: Pitch
- Bugs: rdar://99276048, rdar://99832858
- Implementation: (pending)
Introduction
We propose adding new String
failable initializers that validate encoded input, and return nil
when the input contains any invalid elements.
Motivation
The String
type guarantees that it represents well-formed Unicode text. When data representing text is received from a file, the network, or some other source, it may be relevant to store it in a String
, but that data must be validated first. String
already provides a way to transform data to valid Unicode by repairing invalid elements, but such a transformation is often not desirable, especially when dealing with untrusted sources. For example a JSON decoder cannot transform its input; it must fail if a span representing text contains any invalid UTF-8.
This functionality has not been available directly from the standard library. It is possible to compose it using existing public API, but only at the cost of extra memory copies and allocations. The standard library is uniquely positioned to implement this functionality in a performant way.
Proposed Solution
We will add a new String
initializer that can fail, returning nil
, when its input is found to be invalid according the encoding represented by a type parameter that conforms to Unicode.Encoding
.
extension String {
public init?<Encoding: Unicode.Encoding>(
validating codeUnits: some Sequence<Encoding.CodeUnit>, as: Encoding.Type
)
}
For convenience and discoverability, we will also provide initializers that specify the input encoding as part as an argument label:
extension String {
public init?(validatingFromUTF8 codeUnits: some Sequence<UTF8.CodeUnit>)
public init?(validatingFromUTF16 codeUnits: some Sequence<UTF16.CodeUnit>)
public init?(validatingFromUTF32 codeUnits: some Sequence<UTF32.CodeUnit>)
}
These will construct a new String
, returning nil
when their input is found invalid according to the encoding specified by the label.
When handling with data obtained from C, it is frequently the case that UTF-8 data is represented by CChar
rather than UInt8
. We will provide a convenience initializer for this use case, noting that it typically involves contiguous memory, and as such is well-served by explicitly using an abstraction for contiguous memory (UnsafeBufferPointer<CChar>
):
extension String {
public init?(validatingFromUTF8 codeUnits: UnsafeBufferPointer<CChar>)
}
String
already features a validating initializer for UTF-8 input. Is is intended for C interoperability, but its argument label does not convey the expectation that its input is a null-terminated C string. We propose to rename it in order to clarify this:
extension String {
public init?(validatingCString nullTerminatedUTF8: UnsafePointer<CChar>)
@available(Swift 5.XLIX, deprecated, renamed:"String.init(validatingCString:)")
public init?(validatingUTF8 cString: UnsafePointer<CChar>)
}
Detailed Design
Please see the gist for details.
Source Compatibility
This proposal is strictly additive.
ABI Compatibility
This proposal adds new functions to the ABI.
Implications on adoption
This feature requires a new version of the standard library.
Alternatives considered
The validatingUTF8
argument label
The argument label validatingUTF8
seems like it would have been preferable to validatingFromUTF8
, but using the former would have been source-breaking. The C string validation initializer takes an UnsafePointer<UInt8>
, but that is also valid with [UInt8]
via implicit pointer conversion. Any use site that passes an [UInt8]
to the C string validation initializer would have changed behaviour upon recompilation, going to considering a null character (\0
) as the termination of the C string to considering it as a valid character.
Have the CChar
-validating function take a parameter of type some Sequence<CChar>
This would produce a compile-time ambiguity on platforms where CChar
is typealiased to UInt8
rather than Int8
. Using UnsafeBufferPointer<CChar>
as the parameter type will avoid such a compile-time ambiguity.
Acknowledgements
Thanks to Michael Ilseman, Tina Liu and Quinn Quinn for discussions about input validation issues.
SE-0027 by Zachary Waldowski was reviewed in February 2016, covering similar ground. It was rejected at the time because the design of String
had not been finalized. The name String.init(validatingCString:)
was included in that proposal. Lily Ballard later pitched a renaming of String.init(validatingUTF8:)
, citing consistency with other String
API involving C strings.