[Revised and review extended] SE-0180 - String Index Overhaul

Proposal Link: swift-evolution/0180-string-index-overhaul.md at master · apple/swift-evolution · GitHub

The review of “SE-0180 - String Index Overhaul” ran from June 4…8, 2017.

Feedback on the proposal and further discussion in the Core Team has resulted in a slightly revised proposal. A revised review of that proposal will run from now until June 28.

The revised proposal focuses on feedback concerning cases where indices fall between Unicode scalar boundaries in views having distinct encodings. Please see the revised review (link above) for the specific technical revisions.

Thanks to everyone who participated in providing feedback for the review, and please take a look at the revised proposal!

  - Ted
  Review Manager

— REVISED REVIEW —

When replying, please try to keep the proposal link at the top of the message:

Proposal link:

swift-evolution/0180-string-index-overhaul.md at master · apple/swift-evolution · GitHub
Reply text

Other replies

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

Given the discussion in the original thread about potentially having
Strings backed by something other than utf16 code units, I'm somewhat
concerned about having this kind of vague `encodedOffset` that happens
to be UTF16 code units. If this is supposed to represent an offset into
whatever code units the String is backed by, then it's going to be a
problem because the user isn't supposed to know or care what the
underlying storage for the String is. And I can imagine potential issues
with archiving/unarchiving where the unarchived String has a different
storage type than the archived one, and therefore `encodedOffset` would
gain a new meaning that screws up unarchived String.Index values.
The other problem with using this as utf16 is how am I supposed to
archive/unarchive a String.Index that comes from String.UTF8View? AFAICT
the only way to do that is to ignore encodedOffset entirely and instead
calculate the distance between s.utf8.startIndex and my index (and then
recreate the index later on by advancing from startIndex). But this RFC
explicitly says that archiving/unarchiving indices is one of the goals
of this overhaul.

···

--

The section on comparison still talks about how this is a weak ordering.
In the other thread it was explained as being done so because the
internal transcodedOffset isn't public, but that still makes this
explanation very odd. String.Index comparison should not be weak
ordering, because all indices can be expressed in the utf8View if
nothing else, and in that view they have a total order. So it should
just be defined as a total order, based on the position in the utf8View
that the index corresponds with.
--

The detailed design of the index has encodedOffset being mutable (and
this was confirmed in the other thread as intentional). I don't think
this is a good idea, because it makes the following code behave oddly:
  let x = index.encodedOffset
  index.encodedOffset = x

Specifically, this resets the private transcodedOffset, so if you do
this with an intra-code-unit Index taken from the utf8View, the modified
Index may point to a different byte.
I'm also not sure why you'd ever want to do this operation anyway. If
you want to change the encodedOffset, you can just say `index =
String.Index(encodedOffset: x)`.
-Kevin Ballard

Proposal Link: https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

The review of “SE-0180 - String Index Overhaul” ran from June 4…8, 2017.

Feedback on the proposal and further discussion in the Core Team has resulted in a slightly revised proposal. A revised review of that proposal will run from now until June 28.

The revised proposal focuses on feedback concerning cases where indices fall between Unicode scalar boundaries in views having distinct encodings. Please see the revised review (link above) for the specific technical revisions.

Thanks to everyone who participated in providing feedback for the review, and please take a look at the revised proposal!

  - Ted
  Review Manager

— REVISED REVIEW —

When replying, please try to keep the proposal link at the top of the message:

Proposal link:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
Reply text

Other replies

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/master/process.md_______________________________________________
swift-evolution-announce mailing list
swift-evolution-announce@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution-announce

···

On Jun 22, 2017, at 17:10, Ted Kremenek <kremenek@apple.com> wrote:

Accidentally hit send! Sorry everyone.

Here's what I was originally looking for: a diff between the old and new revisions of the proposal: Update SE-0180 per core team discussion · apple/swift-evolution@0d163c9 · GitHub

Jordan

···

On Jun 26, 2017, at 10:47, Jordan Rose <jordan_rose@apple.com> wrote:

On Jun 22, 2017, at 17:10, Ted Kremenek <kremenek@apple.com <mailto:kremenek@apple.com>> wrote:

Proposal Link: https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

The review of “SE-0180 - String Index Overhaul” ran from June 4…8, 2017.

Feedback on the proposal and further discussion in the Core Team has resulted in a slightly revised proposal. A revised review of that proposal will run from now until June 28.

The revised proposal focuses on feedback concerning cases where indices fall between Unicode scalar boundaries in views having distinct encodings. Please see the revised review (link above) for the specific technical revisions.

Thanks to everyone who participated in providing feedback for the review, and please take a look at the revised proposal!

  - Ted
  Review Manager

— REVISED REVIEW —

When replying, please try to keep the proposal link at the top of the message:

Proposal link:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
Reply text

Other replies

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/master/process.md_______________________________________________
swift-evolution-announce mailing list
swift-evolution-announce@swift.org <mailto:swift-evolution-announce@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution-announce

Is that true? The String manifesto shows a design where the underlying Encoding and code-units are exposed.

From the talk about String’s being backed by something that isn’t UTF-16, I took that to mean that String might one-day become generic. Defaults for generic parameters have been mentioned on the list before, so “String” could still refer to “String<UTF16Encoding>” on OSX and maybe “String<UTF8Encoding>” on Linux.

I would support a definition of encodedOffset that removed mention of UTF-16 and phrased things in terms of String.Encoding and code-units. For example, I would like to be able to construct new String indices from a known index plus a quantity of code-units known to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)

- Karl

···

On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution <swift-evolution@swift.org> wrote:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

Given the discussion in the original thread about potentially having Strings backed by something other than utf16 code units, I'm somewhat concerned about having this kind of vague `encodedOffset` that happens to be UTF16 code units. If this is supposed to represent an offset into whatever code units the String is backed by, then it's going to be a problem because the user isn't supposed to know or care what the underlying storage for the String is.

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

Given the discussion in the original thread about potentially having
Strings backed by something other than utf16 code units, I'm somewhat
concerned about having this kind of vague `encodedOffset` that happens
to be UTF16 code units.

What's vague about it?

If this is supposed to represent an offset into whatever code units
the String is backed by, then it's going to be a problem because the
user isn't supposed to know or care what the underlying storage for
the String is.

In the long run a user that cares about performance may very well care
about the underlying encoding. However, that, like access to the
index's encodedOffset, is an expert-level concern.

And I can imagine potential issues with archiving/unarchiving where
the unarchived String has a different storage type than the archived
one, and therefore `encodedOffset` would gain a new meaning that
screws up unarchived String.Index values.

Yes. If you round-trip serialize the index and you don't also preserve
the encoding of the string, you will get nonsense. As far as I know
that is a feature of any universe in which multiple arbitrary encodings
exist... like the universe we live in.

The other problem with using this as utf16 is how am I supposed to
archive/unarchive a String.Index that comes from String.UTF8View?

You can serialize its encodedOffset, which will work as long as the
index is on a Unicode scalar boundary. If it is not, you can compute
the notional transcodedOffset and serialize that too, by measuring the
distance in the UTF8 view to your index from the previous unicode scalar
boundary. Then you can reconstruct that position. Obviously this is
inconvenient. The proposal is not trying to solve the problem of doing
that conveniently today; it's only trying to lay the necessary
groundwork.

AFAICT the only way to do that is to ignore encodedOffset entirely and
instead calculate the distance between s.utf8.startIndex and my index
(and then recreate the index later on by advancing from
startIndex).

But this RFC explicitly says that archiving/unarchiving indices is one
of the goals of this overhaul. --

The section on comparison still talks about how this is a weak
ordering. In the other thread it was explained as being done so
because the internal transcodedOffset isn't public, but that still
makes this explanation very odd. String.Index comparison should not be
weak ordering, because all indices can be expressed in the utf8View if
nothing else, and in that view they have a total order. So it should
just be defined as a total order, based on the position in the
utf8View that the index corresponds with. --

An index between the halves of a UTF16 surrogate pair has no
corresponding position in the UTF8 view. You could arbitrarily choose
one (e.g. two UTF8 code units past the start of the unicode scalar), but
I'm not sure that would produce better results.

The detailed design of the index has encodedOffset being mutable (and
this was confirmed in the other thread as intentional). I don't think
this is a good idea, because it makes the following code behave oddly:
  let x = index.encodedOffset
  index.encodedOffset = x

Specifically, this resets the private transcodedOffset, so if you do
this with an intra-code-unit Index taken from the utf8View, the
modified Index may point to a different byte. I'm also not sure why
you'd ever want to do this operation anyway. If you want to change the
encodedOffset, you can just say `index = String.Index(encodedOffset:
x)`.

I can take or leave the mutability of encodedOffset, personally.

···

on Thu Jun 22 2017, Kevin Ballard <swift-evolution@swift.org> wrote:

--
-Dave

I would support a definition of encodedOffset that removed mention of UTF-16 and phrased things in terms of String.Encoding and code-units. For example, I would like to be able to construct new String indices from a known index plus a quantity of code-units known to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)

I second this concern. We currently use a non-Foundation library that prefers UTF8 encoding, I think UTF8-backed strings are important.

The choice of UTF16 as string storage in Swift makes historical sense (e.g. runtime interop with ObjC-backed strings) but as Swift moves forward it makes less sense. We need a string system that behaves more like a lightweight accessor for the underlying storage (e.g. if you like your input's encoding you can keep it) unless you do something (like peruse a view) that requires promotion to a new format. That's a different proposal, but that's the direction I'd like to see us head.

This proposal is in many ways the opposite of that, it specifies that we standardize on UTF16, and in particular we have in view the problem of file archiving (where we would have long-term unarchival guarantees) that complicate backing this out later. This feels like a kludge to support Foundation. In the archive context the offset should either be "whatever the string is" (which you would have to know anyway to archive/unarchive that string) or a full-fledged offset type that specifies the encoding such as

let i = String.Index (
encoding: .utf16
offset: 36
)

the latter of which would be used to port an Index between string representations if that's a useful feature.

More broadly though, I disagree with the motivation of the proposal, specifically

The result is a great deal of API surface area for apparently little gain in ordinary code

In ordinary code, we work with a single string representation (e.g. in Cocoa it's UTF16), and there is a correspondence between our UTF16 offset and our UTF16 string such that index lookups will succeed. When we collapse indexes, we lose the information to make this correspondence, which were previously encoded into the typesystem. So the "gain in ordinary code" is that programmers do not have to sprinkle `!` in the common case of string index lookups because we can infer at compile time from the type correspondence it is unnecessary.

Under this proposal, they will have to sprinkle the `!`, which adds friction and performance impact (`!` is a runtime check, and UTF16 promotions are expensive). I don't believe the simplicity of implementing archival (which one has to only write once) is worth the hassle of complicating all string index lookups.

Does this proposal fit well with the feel and direction of Swift?

To me, one of Swift's greatest strengths is the type system. We can encode information into the type system and find our bugs at compile time instead of runtime.

Here, we are proposing to partially erase a type because it's annoying to write code that deals with string encodings. But our code will deal with string encodings somehow whether `utf16` appears in our sourcecode or not.

When we erase the type of our offset, we lose a powerful tool to prove the correctness of our string encodings, that is, the compiler can check our utf16 offset is used with a utf16 string. Without that tool, we either have to check that dynamically, or, worst case, there are bugs in our program.

Under this proposal we would encourage the use of a bare-integer offsets for string lookup. That does not seem Swifty to me. A Swifty solution would be to add a dynamically-checked type-erased String.Index alongside the existing statically-checked fully-typed String.UTF8/16View.Index so that the programmer can choose the abstraction with the performance/simplicity behavior appropriate for their problem.

···

On June 26, 2017 at 5:43:42 PM, Karl Wagner via swift-evolution (swift-evolution@swift.org) wrote:

Hi Karl,

It was pointed out to me that I never answered this thoughtful post of
yours...

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
<https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md&gt;

Given the discussion in the original thread about potentially having
Strings backed by something other than utf16 code units, I'm
somewhat concerned about having this kind of vague `encodedOffset`
that happens to be UTF16 code units. If this is supposed to
represent an offset into whatever code units the String is backed
by, then it's going to be a problem because the user isn't supposed
to know or care what the underlying storage for the String is.

Is that true? The String manifesto shows a design where the underlying
Encoding and code-units are exposed.

That is the eventual goal. Note that with this proposal we are making
progress towards that goal, but getting all the way there is out of
scope for this release.

From the talk about String’s being backed by something that isn’t
UTF-16, I took that to mean that String might one-day become
generic. Defaults for generic parameters have been mentioned on the
list before, so “String” could still refer to “String<UTF16Encoding>”
on OSX and maybe “String<UTF8Encoding>” on Linux.

I think you may have misunderstood. String currently supports a few
different underlying representations (ASCII, UTF-16, NSString), all of
which happen to use a UTF-16-compatible encoding. The eventual goal is
to expand the possible underlying representations of String to
accomodate other encodings.

That said, the underlying representation of String is *not* part of
String's type, and we don't intend to change that. When String APIs
access the underlying representation, that access is dynamically
dispatched. If the encoding were a generic parameter, then it would be
statically dispatched (at least in part), but it would also become part
of String's type, and, for example, you would get an error when trying
to pass a String<Unicode.ASCII> where a String<Unicode.UTF16> was
expected. It's important that code passing Strings around remain
smoothly interoperable, so we don't want to introduce this sort of type
mismatch.

Instead, the intention is that someone could make a UTF8String type that
conformed to StringProtocol, and that String itself could be constructed
from any instance of StringProtocol to be used as its underlying
representation. That way, if you need the performance that comes with
knowing and manipulating the underlying encoding, you can use UTF8String
directly, and if you need to interoperate with code that uses the
lingua-franca String type, you can wrap String around your UTF8String
and pass that.

I would support a definition of encodedOffset that removed mention of
UTF-16 and phrased things in terms of String.Encoding and
code-units.

Well, a few points about this:

I support removing the text “(UTF-16)” from the initial documentation
comments on these APIs, which is, AFAICT, the only source of the concern
you and others have expressed. That said, Strings are in fact currently
encoded as UTF-16 and as long as Cocoa interop is important, that too is
important and useful information, so it should be documented somewhere.

I don't support describing anything in terms of String.Encoding at this
time. That enum was added to String by the Foundation overlay, and is not
part of the plan for String except insofar as it is required for source
compatibility and Cocoa interop. A more appropriate way to describe the
encoding in terms of the language would be as something like
Unicode.UTF16 (at compile-time) or an instance of Unicode.Encoding.Type
(at runtime). But I see no need to describe it in language terms until
we are ready to add APIs to String that can support multiple encodings
and/or report the underlying encoding, and we are not ready to do that
yet.

For example, I would like to be able to construct new String indices
from a known index plus a quantity of code-units known to represent a
sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)

I'm not sure what you mean by “represent a sequence of characters” in
this context. Don't a sequence of code units always represent a
sequence of characters?

The code you wrote above would (almost) work as written under this
proposal, given that Strings always have an encoding that's compatible
with some default. In other words, making it work *depends* on the fact
that the encoding of stringTwo is compatible with (has a non-strict
sub/superset relation with) that of stringOne. If stringOne were
encoded as today but stringTwo were encoded with some other encoding,
say, Shift-JIS, the code might not work. So making code like this work
depends on the very information that you have expressed a concern abut
seeing in the documentation.

The reason I wrote “(almost)” above is that we are not yet proposing to
expose a “codeUnits” view on String, and shouldn't do so until we are
ready to introduce the more flexible encoding options discussed earlier.
Today, you'd use the utf16 view to get that information. We are headed
down the road in your vision, but we can't arrive there in this release.

Hope this helps,

···

on Mon Jun 26 2017, Karl Wagner <swift-evolution@swift.org> wrote:

On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution >> <swift-evolution@swift.org> wrote:

--
-Dave

Where did anyone get that idea? That is not at all the intention. The
intention of mentioning UTF-16 in the proposal is merely to
*acknowledge* that today, all strings have a UTF-16-compatible encoding,
to help people understand what encodedOffset will mean in practice with
today's string implementation.

···

on Tue Jun 27 2017, Drew Crawford <swift-evolution@swift.org> wrote:

On June 26, 2017 at 5:43:42 PM, Karl Wagner via swift-evolution > (swift-evolution@swift.org) wrote:

I would support a definition of encodedOffset that removed mention of
UTF-16 and phrased things in terms of String.Encoding and
code-units. For example, I would like to be able to construct new
String indices from a known index plus a quantity of code-units known
to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)

I second this concern. We currently use a non-Foundation library that prefers UTF8 encoding, I
think UTF8-backed strings are important.

The choice of UTF16 as string storage in Swift makes historical sense
(e.g. runtime interop with ObjC-backed strings) but as Swift moves
forward it makes less sense. We need a string system that behaves
more like a lightweight accessor for the underlying storage (e.g. if
you like your input's encoding you can keep it) unless you do
something (like peruse a view) that requires promotion to a new
format. That's a different proposal, but that's the direction I'd
like to see us head.

This proposal is in many ways the opposite of that, it specifies that
we standardize on UTF16

--
-Dave