[Review] SE-0027 Expose code unit initializers on String

Patrick_Gili · February 16, 2016, 1:15pm

Hi Zach,

On your advice, I went back and read the sections of the Swift book relating to Strings and Characters. While it is easy to see that a String is a "linked list of text-isa things" or a "collection of Characters", I do not see anything that encourages a developer to treat a String as a "bag of bytes". I would not misinterpret the section on "Unicode Representations of Strings" as a "bag of bytes". Perhaps you can show be more specific and quote some text that gives you this impression.

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

String offers a robust toolbox for dealing with a "bag of bytes", but to use it such represents an abuse. I think NSString may have encouraged years of abuse. Even more than a Uint8View for String, which would only perpetuate the abuse, I would like to determine the shortcomings of [Uint8], as this is the purest representation of a "bag of bytes".

Cheers,
-Patrick

···

On Feb 14, 2016, at 1:40 AM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

I think you're drawing an overly arbitrary distinction about the
semantics. I'd recommend a close re-reading of the Swift book's chapters
on String after their reworking in 2.0; it bridges together the "linked
list of text-ish things", "collection of Characters", and "bag of bytes"
ideas rather well. They're not mutually exclusive.

The new methods do not decrease the safety of String, nor does it change
the contract of the API design. It should not be possible to get
malformed strings back from the new API; the non-validating version
automatically performs repairs, and the validating version fails (by
returning nil) on any errors. In fact, exposing these APIs in a way that
is aware and respectful of String's underpinnings is safer than the
alternative. The stdlib won't screw up things like surrogate pairs or
range checking of valid code points, whereas I've seen plenty of code
try and do what these methods do themselves by upcasting UInt8 to
UnicodeScalar and accumulating.

Addressing other points about the proposal: I overall agree with you
that the Views would do a better job of this on the long scale of time,
but C and ObjC interop simple require entry points like the ones in this
proposal, and are in-line with how Swift works today. This proposal is
not intended to overhaul String, even though that may be one day
desirable by what Dave and others said on the Evolution thread.

Thanks for your feedback! :)

Zach Waldowski
zach@waldowski.me

On Sat, Feb 13, 2016, at 05:33 PM, Patrick Gili via swift-evolution > wrote:

Okay. However, does this change the implied semantics?

On Feb 13, 2016, at 5:26 PM, Brent Royal-Gordon <brent@architechies.com> wrote:

The introduction starts out by making the claim, "Going back and forth from Strings to their byte representations is an important part of solving many problems, including object serialization, binary and text file formats, wire/network interfaces, and cryptography." Essentially, these problems deal with an array of raw bytes, and I have to wonder why an application would push them into a String?

I read this section as trying to say "object serialization, binary and text file formats, wire/network interfaces, and cryptography all require you to construct strings from decoded bytes, which is what this proposal is trying to improve". I don't think it's trying to say that we should have better support for treating strings as bags of arbitrary bytes, and in fact I don't think this proposal does that.

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Charles_Kissinger · February 16, 2016, 7:08pm

Hi Patrick,

I think the “bag of bytes” characterization might result from a shortcoming in the wording of the proposal, but it’s not actually a concern for the proposed new methods themselves. The intended use for these methods is to convert ASCII, UTF8 or UTF16 code sequences into Strings. That’s about as fundamental to the functioning of a Unicode-compliant String class as you can get.

It’s true that you could use these to convert some arbitrary sequence of bytes into a String. If there are invalid characters that result from that, the failable initializer will fail and the standard initializer will silently “repair” the characters. The decode() method will tell you whether it repaired anything. Those are exactly the set of options I would want.

—CK

···

On Feb 16, 2016, at 5:15 AM, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

Hi Zach,

On your advice, I went back and read the sections of the Swift book relating to Strings and Characters. While it is easy to see that a String is a "linked list of text-isa things" or a "collection of Characters", I do not see anything that encourages a developer to treat a String as a "bag of bytes". I would not misinterpret the section on "Unicode Representations of Strings" as a "bag of bytes". Perhaps you can show be more specific and quote some text that gives you this impression.

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

String offers a robust toolbox for dealing with a "bag of bytes", but to use it such represents an abuse. I think NSString may have encouraged years of abuse. Even more than a Uint8View for String, which would only perpetuate the abuse, I would like to determine the shortcomings of [Uint8], as this is the purest representation of a "bag of bytes".

Cheers,
-Patrick

On Feb 14, 2016, at 1:40 AM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

I think you're drawing an overly arbitrary distinction about the
semantics. I'd recommend a close re-reading of the Swift book's chapters
on String after their reworking in 2.0; it bridges together the "linked
list of text-ish things", "collection of Characters", and "bag of bytes"
ideas rather well. They're not mutually exclusive.

The new methods do not decrease the safety of String, nor does it change
the contract of the API design. It should not be possible to get
malformed strings back from the new API; the non-validating version
automatically performs repairs, and the validating version fails (by
returning nil) on any errors. In fact, exposing these APIs in a way that
is aware and respectful of String's underpinnings is safer than the
alternative. The stdlib won't screw up things like surrogate pairs or
range checking of valid code points, whereas I've seen plenty of code
try and do what these methods do themselves by upcasting UInt8 to
UnicodeScalar and accumulating.

Addressing other points about the proposal: I overall agree with you
that the Views would do a better job of this on the long scale of time,
but C and ObjC interop simple require entry points like the ones in this
proposal, and are in-line with how Swift works today. This proposal is
not intended to overhaul String, even though that may be one day
desirable by what Dave and others said on the Evolution thread.

Thanks for your feedback! :)

Zach Waldowski
zach@waldowski.me

On Sat, Feb 13, 2016, at 05:33 PM, Patrick Gili via swift-evolution >> wrote:

Okay. However, does this change the implied semantics?

On Feb 13, 2016, at 5:26 PM, Brent Royal-Gordon <brent@architechies.com> wrote:

The introduction starts out by making the claim, "Going back and forth from Strings to their byte representations is an important part of solving many problems, including object serialization, binary and text file formats, wire/network interfaces, and cryptography." Essentially, these problems deal with an array of raw bytes, and I have to wonder why an application would push them into a String?

I read this section as trying to say "object serialization, binary and text file formats, wire/network interfaces, and cryptography all require you to construct strings from decoded bytes, which is what this proposal is trying to improve". I don't think it's trying to say that we should have better support for treating strings as bags of arbitrary bytes, and in fact I don't think this proposal does that.

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

glessard · February 16, 2016, 6:36pm

How?
When the input is not valid unicode, these initializers either repair inconsistencies before returning a valid (perhaps gibberish) String, or fail and return nil. (In principle, bugs notwithstanding.)

Note that the exact same code can be invoked right now by copying bytes to a buffer, appending a zero and calling String.fromCString(). Should that function be eliminated?

Cheers,
Guillaume Lessard

···

On 16 févr. 2016, at 06:15, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

Patrick_Gili · February 16, 2016, 8:27pm

Hi Charles,

If the intent is to initialize a String with a "bag of characters", then I'm fine. Sounds like it would be difficult to initialize a String from an arbitrary sequence of bytes without the possibility of mutating the "bag of bytes" imported into the instance.

Cheers,
-Patrick

···

On Feb 16, 2016, at 2:08 PM, Charles Kissinger <crk@akkyra.com> wrote:

Hi Patrick,

I think the “bag of bytes” characterization might result from a shortcoming in the wording of the proposal, but it’s not actually a concern for the proposed new methods themselves. The intended use for these methods is to convert ASCII, UTF8 or UTF16 code sequences into Strings. That’s about as fundamental to the functioning of a Unicode-compliant String class as you can get.

It’s true that you could use these to convert some arbitrary sequence of bytes into a String. If there are invalid characters that result from that, the failable initializer will fail and the standard initializer will silently “repair” the characters. The decode() method will tell you whether it repaired anything. Those are exactly the set of options I would want.

—CK

On Feb 16, 2016, at 5:15 AM, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

Hi Zach,

On your advice, I went back and read the sections of the Swift book relating to Strings and Characters. While it is easy to see that a String is a "linked list of text-isa things" or a "collection of Characters", I do not see anything that encourages a developer to treat a String as a "bag of bytes". I would not misinterpret the section on "Unicode Representations of Strings" as a "bag of bytes". Perhaps you can show be more specific and quote some text that gives you this impression.

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

String offers a robust toolbox for dealing with a "bag of bytes", but to use it such represents an abuse. I think NSString may have encouraged years of abuse. Even more than a Uint8View for String, which would only perpetuate the abuse, I would like to determine the shortcomings of [Uint8], as this is the purest representation of a "bag of bytes".

Cheers,
-Patrick

On Feb 14, 2016, at 1:40 AM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:

I think you're drawing an overly arbitrary distinction about the
semantics. I'd recommend a close re-reading of the Swift book's chapters
on String after their reworking in 2.0; it bridges together the "linked
list of text-ish things", "collection of Characters", and "bag of bytes"
ideas rather well. They're not mutually exclusive.

The new methods do not decrease the safety of String, nor does it change
the contract of the API design. It should not be possible to get
malformed strings back from the new API; the non-validating version
automatically performs repairs, and the validating version fails (by
returning nil) on any errors. In fact, exposing these APIs in a way that
is aware and respectful of String's underpinnings is safer than the
alternative. The stdlib won't screw up things like surrogate pairs or
range checking of valid code points, whereas I've seen plenty of code
try and do what these methods do themselves by upcasting UInt8 to
UnicodeScalar and accumulating.

Addressing other points about the proposal: I overall agree with you
that the Views would do a better job of this on the long scale of time,
but C and ObjC interop simple require entry points like the ones in this
proposal, and are in-line with how Swift works today. This proposal is
not intended to overhaul String, even though that may be one day
desirable by what Dave and others said on the Evolution thread.

Thanks for your feedback! :)

Zach Waldowski
zach@waldowski.me

On Sat, Feb 13, 2016, at 05:33 PM, Patrick Gili via swift-evolution >>> wrote:

Okay. However, does this change the implied semantics?

On Feb 13, 2016, at 5:26 PM, Brent Royal-Gordon <brent@architechies.com> wrote:

The introduction starts out by making the claim, "Going back and forth from Strings to their byte representations is an important part of solving many problems, including object serialization, binary and text file formats, wire/network interfaces, and cryptography." Essentially, these problems deal with an array of raw bytes, and I have to wonder why an application would push them into a String?

I read this section as trying to say "object serialization, binary and text file formats, wire/network interfaces, and cryptography all require you to construct strings from decoded bytes, which is what this proposal is trying to improve". I don't think it's trying to say that we should have better support for treating strings as bags of arbitrary bytes, and in fact I don't think this proposal does that.

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Patrick_Gili · February 16, 2016, 8:31pm

Hi Guillaume,

Sorry, my mail client presents messages in a reverse chronological order. Charles replied with almost the same response as yours.

The proposal is somewhat confusing in this regard. It is not entirely clear that this is not intended to abuse String. However, you and Charles have explained it to me, and I'm fine.

Dave Abrahams echoed a cancern of mine though. The section discussing alternatives presents an alternative that might present a better solution to the problem. We should discuss this.

Cheers,
-Patrick

···

On Feb 16, 2016, at 1:36 PM, Guillaume Lessard <glessard@tffenterprises.com> wrote:

On 16 févr. 2016, at 06:15, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

How?
When the input is not valid unicode, these initializers either repair inconsistencies before returning a valid (perhaps gibberish) String, or fail and return nil. (In principle, bugs notwithstanding.)

Note that the exact same code can be invoked right now by copying bytes to a buffer, appending a zero and calling String.fromCString(). Should that function be eliminated?

Cheers,
Guillaume Lessard

Charles_Kissinger · February 16, 2016, 11:18pm

Hi Guillaume,

Sorry, my mail client presents messages in a reverse chronological order. Charles replied with almost the same response as yours.

The proposal is somewhat confusing in this regard. It is not entirely clear that this is not intended to abuse String. However, you and Charles have explained it to me, and I'm fine.

Dave Abrahams echoed a cancern of mine though. The section discussing alternatives presents an alternative that might present a better solution to the problem. We should discuss this.

Personally, I’m not a fan of moving some cases of String initialization (or appending) into nested types (mutable versions of UTF8View and UTF16View in this case). I think it would make the interface more complex for users. Maybe I'm just not be seeing the advantages of this approach though. The proposal suggests it might be better for API maintenance, so maybe Zach can elaborate on that.

—CK

···

On Feb 16, 2016, at 12:31 PM, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

Cheers,
-Patrick

On Feb 16, 2016, at 1:36 PM, Guillaume Lessard <glessard@tffenterprises.com> wrote:

On 16 févr. 2016, at 06:15, Patrick Gili via swift-evolution <swift-evolution@swift.org> wrote:

I would argue that these methods decrease the safety of a String and that they do indeed change the contract of the API design. If an application opens a truly binary file (e.g., something that was encrypted or a executable) and you initialize a String using these contents, I would argue that the String does not hold valid characters, and hence the value of the String is not a string-value.

How?
When the input is not valid unicode, these initializers either repair inconsistencies before returning a valid (perhaps gibberish) String, or fail and return nil. (In principle, bugs notwithstanding.)

Note that the exact same code can be invoked right now by copying bytes to a buffer, appending a zero and calling String.fromCString(). Should that function be eliminated?

Cheers,
Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

glessard · February 16, 2016, 11:24pm

Agreed.

One alternative is to have initializers on UTF{8,16}View:

String.UTF16View.init<Input: CollectionType where Input.Generator.Element == UTF16.CodeUnit>(input: Input)
String.UTF8View.init<Input: CollectionType where Input.Generator.Element == UTF8.CodeUnit>(input: Input)

There would probably be repairing and failable versions. However, the String initializers from UTF8View and UTF16View are failable themselves:

String.init?(_ utf8: String.UTF8View)
String.init?(_ utf16: String.UTF16View)

Either we allow building incorrect instances of UTF{8,16}View, or their slices must be correct at all times and the String initializers can be made non-failable. Or we have to check for nil twice in the process. I don’t see how this would be more elegant.

The alternative specifically mentioned in 0027 is mutable forms of UTF{8,16}View; I’m not sure what the idea is here. Does applying a mutation to String.utf8 mutate the parent String?

In the current proposal (and the current state of String), data goes into a String via one of its initializers.
Data comes out of a String via one of the view types. It's fairly straightforward.

Guillaume Lessard

···

On 16 févr. 2016, at 13:31, Patrick Gili <gili.patrick.r@gili-labs.com> wrote:

Dave Abrahams echoed a cancern of mine though. The section discussing alternatives presents an alternative that might present a better solution to the problem. We should discuss this.

Shawn_Erickson · February 17, 2016, 6:36pm

"In the current proposal (and the current state of String), data goes into
a String via one of its initializers.
Data comes out of a String via one of the view types. It's fairly
straightforward."

I agree. This proposal seems in good alignment with current String model
which I think is actually a good model. I believe this proposal exposes a
reasonable API to allow for – what I consider to be – a need capability in
the base String API.

In summary adding String initializers that accept code point collections is
useful and they can be designed such that you always get a valid String
(e.g. with potential corrective action) or no String (failable initializer).

I see no good reason to muddle with the view aspects of String system (e.g.
UTFXxView).

-Shawn

···

On Tue, Feb 16, 2016 at 3:24 PM Guillaume Lessard via swift-evolution < swift-evolution@swift.org> wrote:

> On 16 févr. 2016, at 13:31, Patrick Gili <gili.patrick.r@gili-labs.com> > wrote:
>
> Dave Abrahams echoed a cancern of mine though. The section discussing
alternatives presents an alternative that might present a better solution
to the problem. We should discuss this.

Agreed.

One alternative is to have initializers on UTF{8,16}View:

String.UTF16View.init<Input: CollectionType where Input.Generator.Element
== UTF16.CodeUnit>(input: Input)
String.UTF8View.init<Input: CollectionType where Input.Generator.Element
== UTF8.CodeUnit>(input: Input)

There would probably be repairing and failable versions. However, the
String initializers from UTF8View and UTF16View are failable themselves:

String.init?(_ utf8: String.UTF8View)
String.init?(_ utf16: String.UTF16View)

Either we allow building incorrect instances of UTF{8,16}View, or their
slices must be correct at all times and the String initializers can be made
non-failable. Or we have to check for nil twice in the process. I don’t see
how this would be more elegant.

The alternative specifically mentioned in 0027 is mutable forms of
UTF{8,16}View; I’m not sure what the idea is here. Does applying a mutation
to String.utf8 mutate the parent String?

In the current proposal (and the current state of String), data goes into
a String via one of its initializers.
Data comes out of a String via one of the view types. It's fairly
straightforward.

Guillaume Lessard

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution