SE-0180: String Index Overhaul


(Ted Kremenek) #1

Hello Swift community,

The review of SE-0180 "String Index Overhaul" begins now and runs through June 8, 2017.

The proposal is available here:


Reviews are an important part of the Swift evolution process. All reviews should be sent to the swift-evolution mailing list at:

https://lists.swift.org/mailman/listinfo/swift-evolution
or, if you would like to keep your feedback private, directly to the review manager. When replying, please try to keep the proposal link at the top of the message:

Proposal link:


Reply text

Other replies
What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:


Thank you,
Ted (Review Manager)


(Lily Ballard) #2

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md

Overall it looks pretty good. But unfortunately the answer to "Will applications still compile but produce different behavior than they used to?" is actually "Yes", when using APIs provided by Foundation. This is because Foundation is currently able to return String.Index values that don't point to Character boundaries.

Specifically, in Swift 3, the following code:

import Foundation

let str = "e\u{301}galite\u{301}"
let r = str.rangeOfCharacter(from: ["\u{301}"])!
print(str[r] == "\u{301}")

will print “true”, because the returned range identifies the combining acute accent only. But with the proposed String.Index revisions, the `str[r]` subscript will return the whole "e\u{301}” combined character.

This is, of course, an edge case, but we need to consider the implications of this and determine if it actually affects anything that’s likely to be a problem in practice.

There’s also the curious case where I can have two String.Index values that compare unequal but actually return the same value when used in a subscript. For example, with the above string, if I have a String.Index(encodedOffset: 0) and a String.Index(encodedOffset: 1). This may not be a problem in practice, but it’s something to be aware of.

I’m also confused by the paragraph about index comparison. It talks about if two indices are valid in a single String view, comparison semantics are according to Collection, and otherwise indexes are compared using encodedOffsets, and this means indexes aren’t totally ordered. But I’m not sure what the first part is supposed to mean. How is comparing indices that are valid within a single view any different than comparing the encodedOffsets?

-Kevin Ballard


(Hooman Mehr) #3

Overall, I am strong +1 on this, but I don’t have time to go through a detailed analysis of how it will affect my own use cases.

···

On Jun 4, 2017, at 4:29 PM, Ted Kremenek via swift-evolution <swift-evolution@swift.org> wrote:

Hello Swift community,

The review of SE-0180 "String Index Overhaul" begins now and runs through June 8, 2017.

The proposal is available here:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
Reviews are an important part of the Swift evolution process. All reviews should be sent to the swift-evolution mailing list at:

https://lists.swift.org/mailman/listinfo/swift-evolution
or, if you would like to keep your feedback private, directly to the review manager. When replying, please try to keep the proposal link at the top of the message:

Proposal link:

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
Reply text

Other replies
What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/master/process.md
Thank you,
Ted (Review Manager)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Dave Abrahams) #4

https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
<https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md>

Overall it looks pretty good. But unfortunately the answer to "Will
applications still compile but produce different behavior than they
used to?" is actually "Yes", when using APIs provided by
Foundation. This is because Foundation is currently able to return
String.Index values that don't point to Character boundaries.

Specifically, in Swift 3, the following code:

import Foundation

let str = "e\u{301}galite\u{301}"
let r = str.rangeOfCharacter(from: ["\u{301}"])!
print(str[r] == "\u{301}")

will print “true”, because the returned range identifies the combining
acute accent only. But with the proposed String.Index revisions, the
`str[r]` subscript will return the whole "e\u{301}” combined
character.

Hmm, true.

This doesn't totally invalidate the concern, but...

The existing behavior is a bug in the way Foundation interfaces with the
3.0 standard library. str.rangeOfCharacter (which should be
str.rangeOfUnicodeScalar) should be returning
Range<String.UnicodeScalarView.Index> but is returning a misaligned
Range<String.Index>. Everything in the 3.0 standard library design is
engineered to ensure that misaligned String indices don't happen at all
(although they still can—just use an index from string1 in string2),
thus the rigorous failable index conversion APIs.

It's easy to produce results with this API that don't make sense in
Swift 3:

  let str = "e\u{301}\u{302}galite\u{301}"
  str.rangeOfCharacter(from: ["\u{301}"])!
  print(str[r.lowerBound] == "\u{301}") // false

This is, of course, an edge case, but we need to consider the
implications of this and determine if it actually affects anything
that’s likely to be a problem in practice.

I agree. It would also be reasonable to pick a different behavior for
misaligned indices, for example:

  Indices *that don't fall on a code unit boundary* are “rounded down”
  before use.

The existing behaviors for these cases are a cluster of coincidences,
and were never designed. I doubt that preserving them in their current
form makes sense and will lead to a usable string semantics for the long
term, but if they do in fact happen to make sense, we'd still need to
codify the rules so we can keep future behaviors consistent.

There’s also the curious case where I can have two String.Index values
that compare unequal but actually return the same value when used in a
subscript.
For example, with the above string, if I have a
String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
1). This may not be a problem in practice, but it’s something to be
aware of.

I don't think this one even rises to that level.

let s = "aaa"
var si = s.indices.makeIterator()
let i0 = si.next()!
let i1 = si.next()!
print(i0 == i1) // false
print(s[i0] == s[i1]) // true. Surprised?

I’m also confused by the paragraph about index comparison. It talks
about if two indices are valid in a single String view, comparison
semantics are according to Collection, and otherwise indexes are
compared using encodedOffsets, and this means indexes aren’t totally
ordered. But I’m not sure what the first part is supposed to mean. How
is comparing indices that are valid within a single view any different
than comparing the encodedOffsets?

In today's String, encodedOffset is an offset in UTF-16. Two indices
into a UTF-8 view may be unequal yet have the same encodedOffset.

Regards,

···

on Mon Jun 05 2017, Kevin Ballard <swift-evolution@swift.org> wrote:

--
-Dave


(TJ Usiyan) #5

+1

I only gave it a quick read though.

···

On Sun, Jun 11, 2017 at 3:01 PM, Hooman Mehr via swift-evolution < swift-evolution@swift.org> wrote:

Overall, I am strong +1 on this, but I don’t have time to go through a
detailed analysis of how it will affect my own use cases.

On Jun 4, 2017, at 4:29 PM, Ted Kremenek via swift-evolution < > swift-evolution@swift.org> wrote:

Hello Swift community,

The review of SE-0180 "String Index Overhaul" begins now and runs through *June
8, 2017*.

The proposal is available here:

https://github.com/apple/swift-evolution/blob/master/
proposals/0180-string-index-overhaul.md

Reviews are an important part of the Swift evolution process. All reviews
should be sent to the swift-evolution mailing list at:

https://lists.swift.org/mailman/listinfo/swift-evolution

or, if you would like to keep your feedback private, directly to the
review manager. When replying, please try to keep the proposal link at the
top of the message:

Proposal link:

https://github.com/apple/swift-evolution/blob/master/
proposals/0180-string-index-overhaul.md
Reply text

Other replies

What goes into a review?

The goal of the review process is to improve the proposal under review
through constructive criticism and, eventually, determine the direction of
Swift. When writing your review, here are some questions you might want to
answer in your review:

   - What is your evaluation of the proposal?
   - Is the problem being addressed significant enough to warrant a
   change to Swift?
   - Does this proposal fit well with the feel and direction of Swift?
   - If you have used other languages or libraries with a similar
   feature, how do you feel that this proposal compares to those?
   - How much effort did you put into your review? A glance, a quick
   reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/master/process.md

Thank you,
Ted (Review Manager)
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Lily Ballard) #6

> There’s also the curious case where I can have two String.Index values
> that compare unequal but actually return the same value when used in a
> subscript.
> For example, with the above string, if I have a
> String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
> 1). This may not be a problem in practice, but it’s something to be
> aware of.

I don't think this one even rises to that level.

let s = "aaa"
var si = s.indices.makeIterator()
let i0 = si.next()!
let i1 = si.next()!
print(i0 == i1) // false
print(s[i0] == s[i1]) // true. Surprised?

Good point.

> I’m also confused by the paragraph about index comparison. It talks
> about if two indices are valid in a single String view, comparison
> semantics are according to Collection, and otherwise indexes are
> compared using encodedOffsets, and this means indexes aren’t totally
> ordered. But I’m not sure what the first part is supposed to mean. How
> is comparing indices that are valid within a single view any different
> than comparing the encodedOffsets?

In today's String, encodedOffset is an offset in UTF-16. Two indices
into a UTF-8 view may be unequal yet have the same encodedOffset.

Ah, right. So a String.Index is actually something similar to

public struct Index {
    public var encodedOffset: Int
    private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
}

In this case, can't we still define String.Index comparison as merely being the lexicographical comparison of (encodedOffset, byteOffset)?

Also, as a side note, the proposal implies that encodedOffset is mutable. Is this actually the case? If so, I assume that mutating it would also reset the byteOffset?

-Kevin Ballard

···

On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:

on Mon Jun 05 2017, Kevin Ballard <swift-evolution@swift.org> wrote:


(Dave Abrahams) #7

Having considered this further, I'd like to propose these revised semantics for
misaligned indices, to preserve the behavior of rangeOfCharacter and its
ilk:

* Definition: an index i is aligned with respect to a string view v iff

     v.indices.contains(i) || v.endIndex == i

  If i is not aligned with respect to v it is *misaligned* with respect
  to v.

* When i is misaligned with respect to a String/Substring view s.xxx
  (imagining s itself could also be spelled as s.xxx), combining s.xxx
  and i is done in terms of underlying code units and i.encodedOffset.

  It's very hard to write these semantics down precisely in terms of
  existing constructs, but this should give you a sense of what I have
  in mind:

  1. the suffix beginning at i is formed by slicing the underlying
    codeUnits at i.encodedOffset, forming a new Substring around that
    slice, and getting its corresponding xxx view

     s.xxx[i...]

  is roughly equivalent to:

    Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx

  (given that we currently have UTF-16 code units)

  2. similarly

     s.xxx[..<i]

  is equivalent to something like:

    Substring(s.utf16[..<String.Index(encodedOffset: i.encodedOffset)]).xxx

  3. s.xxx[i] is equivalent to s.xxx[i...].first!

  4. s.xxx.index(after: i) is equivalent to s.xxx[i...].indices.dropFirst().first!

  5. s.xxx.index(before: i) is equivalent to s.xxx[..<i].indices.last!

I'm concerned that we have no precise way to specify the semantics of #1
and #2, to the point where it might be better to implement them that way
but leave the semantics unspecified. Another alternative would be to
add the APIs needed to make it possible to express a precise equivalence
instead of a rough equivalence. If anyone has better ideas, I'm all ears.

···

on Tue Jun 06 2017, Dave Abrahams <swift-evolution@swift.org> wrote:

Overall it looks pretty good. But unfortunately the answer to "Will
applications still compile but produce different behavior than they
used to?" is actually "Yes", when using APIs provided by
Foundation. This is because Foundation is currently able to return
String.Index values that don't point to Character boundaries.

Specifically, in Swift 3, the following code:

import Foundation

let str = "e\u{301}galite\u{301}"
let r = str.rangeOfCharacter(from: ["\u{301}"])!
print(str[r] == "\u{301}")

will print “true”, because the returned range identifies the combining
acute accent only. But with the proposed String.Index revisions, the
`str[r]` subscript will return the whole "e\u{301}” combined
character.

Hmm, true.

This doesn't totally invalidate the concern, but...

The existing behavior is a bug in the way Foundation interfaces with the
3.0 standard library. str.rangeOfCharacter (which should be
str.rangeOfUnicodeScalar) should be returning
Range<String.UnicodeScalarView.Index> but is returning a misaligned
Range<String.Index>. Everything in the 3.0 standard library design is
engineered to ensure that misaligned String indices don't happen at all
(although they still can—just use an index from string1 in string2),
thus the rigorous failable index conversion APIs.

It's easy to produce results with this API that don't make sense in
Swift 3:

  let str = "e\u{301}\u{302}galite\u{301}"
  str.rangeOfCharacter(from: ["\u{301}"])!
  print(str[r.lowerBound] == "\u{301}") // false

This is, of course, an edge case, but we need to consider the
implications of this and determine if it actually affects anything
that’s likely to be a problem in practice.

I agree. It would also be reasonable to pick a different behavior for
misaligned indices, for example:

  Indices *that don't fall on a code unit boundary* are “rounded down”
  before use.

The existing behaviors for these cases are a cluster of coincidences,
and were never designed. I doubt that preserving them in their current
form makes sense and will lead to a usable string semantics for the long
term, but if they do in fact happen to make sense, we'd still need to
codify the rules so we can keep future behaviors consistent.

--
-Dave


(Dave Abrahams) #8

> There’s also the curious case where I can have two String.Index values
> that compare unequal but actually return the same value when used in a

> subscript.
> For example, with the above string, if I have a
> String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
> 1). This may not be a problem in practice, but it’s something to be
> aware of.

I don't think this one even rises to that level.

let s = "aaa"
var si = s.indices.makeIterator()
let i0 = si.next()!
let i1 = si.next()!
print(i0 == i1) // false
print(s[i0] == s[i1]) // true. Surprised?

Good point.

> I’m also confused by the paragraph about index comparison. It talks
> about if two indices are valid in a single String view, comparison
> semantics are according to Collection, and otherwise indexes are
> compared using encodedOffsets, and this means indexes aren’t totally
> ordered. But I’m not sure what the first part is supposed to mean. How
> is comparing indices that are valid within a single view any different
> than comparing the encodedOffsets?

In today's String, encodedOffset is an offset in UTF-16. Two indices
into a UTF-8 view may be unequal yet have the same encodedOffset.

Ah, right. So a String.Index is actually something similar to

public struct Index {
    public var encodedOffset: Int
    private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
}

Similar. I'd write it this way:

public struct Index {
   public var encodedOffset: Int

   // Offset into a UnicodeScalar represented in an encoding other
   // than the String's underlying encoding
   private var transcodedOffset: Int
}

In this case, can't we still define String.Index comparison as merely
being the lexicographical comparison of (encodedOffset, byteOffset)?

Yes, and that's how it's implemented in the PR. But byteOffset is not
part of the user model, so we can't specify it that way.

Also, as a side note, the proposal implies that encodedOffset is
mutable. Is this actually the case? If so, I assume that mutating it
would also reset the byteOffset?

Yes,

     i.encodedOffset = n

is equivalent to

     i = String.Index(encodedOffset: n)

···

on Fri Jun 09 2017, Kevin Ballard <swift-evolution@swift.org> wrote:

On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:

on Mon Jun 05 2017, Kevin Ballard <swift-evolution@swift.org> wrote:
     
--
-Dave


(Xiaodi Wu) #9

I’m coming to this conversation rather late, so forgive the naive question:

Your proposal claims that current code with failable APIs is needlessly
awkward and that most code only interchanges indices that are known to
succeed. So, why is it not simply a precondition of string slicing that the
index be correctly aligned? It seems like this would simplify the behavior
greatly.

···

On Tue, Jun 13, 2017 at 19:04 Dave Abrahams via swift-evolution < swift-evolution@swift.org> wrote:

on Tue Jun 06 2017, Dave Abrahams <swift-evolution@swift.org> wrote:

>> Overall it looks pretty good. But unfortunately the answer to "Will
>> applications still compile but produce different behavior than they
>> used to?" is actually "Yes", when using APIs provided by
>> Foundation. This is because Foundation is currently able to return
>> String.Index values that don't point to Character boundaries.
>>
>> Specifically, in Swift 3, the following code:
>>
>> import Foundation
>>
>> let str = "e\u{301}galite\u{301}"
>> let r = str.rangeOfCharacter(from: ["\u{301}"])!
>> print(str[r] == "\u{301}")
>>
>> will print “true”, because the returned range identifies the combining
>> acute accent only. But with the proposed String.Index revisions, the
>> `str[r]` subscript will return the whole "e\u{301}” combined
>> character.
>
> Hmm, true.
>
> This doesn't totally invalidate the concern, but...
>
> The existing behavior is a bug in the way Foundation interfaces with the
> 3.0 standard library. str.rangeOfCharacter (which should be
> str.rangeOfUnicodeScalar) should be returning
> Range<String.UnicodeScalarView.Index> but is returning a misaligned
> Range<String.Index>. Everything in the 3.0 standard library design is
> engineered to ensure that misaligned String indices don't happen at all
> (although they still can—just use an index from string1 in string2),
> thus the rigorous failable index conversion APIs.
>
> It's easy to produce results with this API that don't make sense in
> Swift 3:
>
> let str = "e\u{301}\u{302}galite\u{301}"
> str.rangeOfCharacter(from: ["\u{301}"])!
> print(str[r.lowerBound] == "\u{301}") // false
>
>> This is, of course, an edge case, but we need to consider the
>> implications of this and determine if it actually affects anything
>> that’s likely to be a problem in practice.
>
> I agree. It would also be reasonable to pick a different behavior for
> misaligned indices, for example:
>
> Indices *that don't fall on a code unit boundary* are “rounded down”
> before use.
>
> The existing behaviors for these cases are a cluster of coincidences,
> and were never designed. I doubt that preserving them in their current
> form makes sense and will lead to a usable string semantics for the long
> term, but if they do in fact happen to make sense, we'd still need to
> codify the rules so we can keep future behaviors consistent.

Having considered this further, I'd like to propose these revised
semantics for
misaligned indices, to preserve the behavior of rangeOfCharacter and its
ilk:

* Definition: an index i is aligned with respect to a string view v iff

     v.indices.contains(i) || v.endIndex == i

  If i is not aligned with respect to v it is *misaligned* with respect
  to v.

* When i is misaligned with respect to a String/Substring view s.xxx
  (imagining s itself could also be spelled as s.xxx), combining s.xxx
  and i is done in terms of underlying code units and i.encodedOffset.

  It's very hard to write these semantics down precisely in terms of
  existing constructs, but this should give you a sense of what I have
  in mind:

  1. the suffix beginning at i is formed by slicing the underlying
    codeUnits at i.encodedOffset, forming a new Substring around that
    slice, and getting its corresponding xxx view

     s.xxx[i...]

  is roughly equivalent to:

    Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx

  (given that we currently have UTF-16 code units)

  2. similarly

     s.xxx[..<i]

  is equivalent to something like:

    Substring(s.utf16[..<String.Index(encodedOffset: i.encodedOffset)]).xxx

  3. s.xxx[i] is equivalent to s.xxx[i...].first!

  4. s.xxx.index(after: i) is equivalent to
s.xxx[i...].indices.dropFirst().first!

  5. s.xxx.index(before: i) is equivalent to s.xxx[..<i].indices.last!

I'm concerned that we have no precise way to specify the semantics of #1
and #2, to the point where it might be better to implement them that way
but leave the semantics unspecified. Another alternative would be to
add the APIs needed to make it possible to express a precise equivalence
instead of a rough equivalence. If anyone has better ideas, I'm all ears.

--
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(David Waite) #10

<snip>

Ah, right. So a String.Index is actually something similar to

public struct Index {
   public var encodedOffset: Int
   private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
}

Similar. I'd write it this way:

public struct Index {
  public var encodedOffset: Int

  // Offset into a UnicodeScalar represented in an encoding other
  // than the String's underlying encoding
  private var transcodedOffset: Int
}

I *think* the following is what the proposal is saying, but let me walk through it:

My understanding would be:
- An index manipulated at the string level points to the start a grapheme cluster which is also a particular code point and to a code unit of the underlying string backing data
- The unicodeScalar view can be intra-grapheme cluster, pointing at a code point
- The utf-16 index can be intra-codepoint, since some code points are represented by two code units
- The uff-8 index can be intra-codepoint as well, since code points are represented by up to four code units

So is the idea of the Index struct is that the encodedOffset is an offset in the native representation of the string (byte offset, word offset, etc) to the start of a grapheme, and transcodedOffset is data for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset within a grapheme to a code point or code unit?

My feeling is that ‘encoded’ is not enough to distinguish whether encodedOffset is meant to indicate an offset in graphemes, code points, or code units, or to specify that an index to the same character in two normalized strings may be different if one is backed by UTF-8 and the other UTF-16. “encodedCharacterOffset” may be better.

This index struct does limit some sorts of imagined string implementations, such as a string maintained piecewise across multiple allocation units or strings using a stateful character encoding like ISO/IEC 2022.

-DW

P.S. I’m also curious why the methods are optional failing vs retaining the current API and having them fatal error.

···

On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:
on Fri Jun 09 2017, Kevin Ballard <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:


(Dave Abrahams) #11

Well, consider the case raised by Kevin Ballard if nothing else: that code would start trapping.

-Dave

···

On Jun 13, 2017, at 6:16 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

I’m coming to this conversation rather late, so forgive the naive question:

Your proposal claims that current code with failable APIs is needlessly awkward and that most code only interchanges indices that are known to succeed. So, why is it not simply a precondition of string slicing that the index be correctly aligned? It seems like this would simplify the behavior greatly.


(Dave Abrahams) #12

<snip>

Ah, right. So a String.Index is actually something similar to

public struct Index {
   public var encodedOffset: Int

   private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
}

Similar. I'd write it this way:

public struct Index {
  public var encodedOffset: Int

  // Offset into a UnicodeScalar represented in an encoding other
  // than the String's underlying encoding
  private var transcodedOffset: Int
}

I *think* the following is what the proposal is saying, but let me
walk through it:

OK. I'm going to be extremely nitpicky about terminology just to ensure
complete clarity; please don't take it as criticism.

My understanding would be:
- An index manipulated at the string level points to the start a
grapheme cluster which is also a particular code point

* A grapheme cluster is not a code point

* Probably you mean that it also points to the start of a code point

* We try not to say “code point” because

  a) despite its loose and liberal use in the Unicode standard,
     according to Unicode experts that term technically means something
     having specifically to do with UTF-16 (IIRC the space of code
     points includes surrogate values), and while it was the same thing
     as a Unicode scalar value in the days of UCS-2, is mostly not a
     useful concept today.

  b) the potential for confusion between “code unit” and “code point” is
     huge; people mix them up all the time.

  c) Instead we use “Unicode scalar value” or “Unicode scalar” for
     short; my advice is to banish the term “code point” from your
     vocabulary as I have—except when picking nits :wink:

and to a code unit of the underlying string backing data

Yes. If String indices were Hashable, then these would all be true:

    Set(s.indices).isSubset(of: s.unicodeScalars.indices)
    Set(s.unicodeScalars.indices).isSubset(of: s.utf16.indices)
    Set(s.unicodeScalars.indices).isSubset(of: s.utf8.indices)

(the views also all have the same endIndex)

Today, the code units are utf16. If we lift that restriction and add a
codeUnits view, then

    Set(s.indices).isSubset(of: s.codeUnits.indices)

- The unicodeScalar view can be intra-grapheme cluster, pointing at a
code point

I don't follow, sorry. I think the unicodeScalar view doesn't point at
anything.

- The utf-16 index can be intra-codepoint, since some code points are
represented by two code units - The uff-8 index can be intra-codepoint
as well, since code points are represented by up to four code units

if we s/codepoint/unicode scalar/, then yes.

So is the idea of the Index struct is that the encodedOffset is an
offset in the native representation of the string (byte offset, word
offset, etc) to the start of a grapheme, and transcodedOffset is data
for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
within a grapheme to a code point or code unit?

Almost. First, remember that transcodedOffset is currently just a
conceptual thing and not part of the proposed API. But if we exposed
it, the following would be true:

  s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
  s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil

and, because the native encoding of Strings is currently always UTF-16 compatible

  s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil

In other words, a non-zero transcodedOffset can only occur in indices
from views that represent the string as code units in something other
than its native encoding, and only if that view is not UTF-32.

My feeling is that ‘encoded’ is not enough to distinguish whether
encodedOffset is meant to indicate an offset in graphemes, code
points, or code units,

IMO if you know Unicode, it does, because **Unicode encoding** is
specifically about *representation* in terms of code units. The
question, then, is whether it's confusing for people who know Unicode
less well, and whether that actually matters. My supposition has been
that, when all the right high-level APIs are in place, most people will
never touch encodedOffset(s). But I could be wrong.

The best alternative I can come up with is “nativeCodeUnitOffset,” which
is a mouthful. We can't just use “codeUnitOffset” because, for example,
in the utf8 view of today's UTF-16-encoded string, this is not about
counting UTF-8 code units; it's still about UTF-16 code units.

or to specify that an index to the same character in two normalized
strings may be different if one is backed by UTF-8 and the other
UTF-16. “encodedCharacterOffset” may be better.

In what way does bringing the word “Character” into this improve things?

This index struct does limit some sorts of imagined string
implementations, such as a string maintained piecewise across multiple
allocation units

I'm pretty certain it does not rule out such an implementation. It was
designed to allow that.

or strings using a stateful character encoding like ISO/IEC 2022.

I don't believe it prevents that either. The index already has state to
avoid repeating work when in a loop such as:

   var i = someView.startIndex
   while i != someView.endIndex {
      somethingWith(someView[i]) // 1
      i = someView.index(after: i) // 2
   }

where lines 1 and 2 both require determining the extent of the element
in underlying code units. There's no reason it couldn't acquire
additional state.

The most efficient way to deal with a String in a particular encoding is
to make a new instance of StringProtocol (say ISO_IEC_2022String), which
would not have to use this index type.

It is planned that eventually String could actually use something like
ISO_IEC_2022String as its backing store. At that point, we'd have a
choice:

1. Allow String.Index to store arbitrary state, burdening it with the
   cost of potential ARC traffic, or

2. Create a limited “scratch space” using fundamental types (e.g., one
   UInt) that every instance of StringProtocol would have to be able to
   use to represent its state.

P.S. I’m also curious why the methods are optional failing vs
retaining the current API and having them fatal error.

Swift 3 has APIs like this:

   extension String.UnicodeScalarView.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index
   }
   extension String.UTF8View.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index?
   }

when String.UnicodeScalarView.Index and String.UTF8View.Index become the
same type (also as String.Index), you're left with:

   extension String.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index?
   }

If you leave these two overloads in place, you break code because

   let x = i.samePositionIn(s.utf16)

is now ambiguous. The only way to keep code functioning is to have
these APIs return optionals.

Hope this helps,

···

on Mon Jun 12 2017, David Waite <swift-evolution@swift.org> wrote:

On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution > <swift-evolution@swift.org> wrote:
on Fri Jun 09 2017, Kevin Ballard >> <swift-evolution@swift.org >> <mailto:swift-evolution@swift.org>> >> wrote:

On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:

--
-Dave


(Xiaodi Wu) #13

If we leave aside for a moment the nomenclature issue where everything in
Foundation referring to a character is really referring to a Unicode
scalar, Kevin’s example illustrates the whole problem in a nutshell,
doesn’t it? In that example, we have a straightforward attempt to slice
with a misaligned index. The totality of options here are:

* return nil, an option the rejection of which is the premise of your
proposal
* return a partial character (i.e., \u{301}), an option which we haven’t
yet talked about in this thread–seems like this could have simpler
semantics, potentially yields garbage if the index is garbage but in the
case of Kevin’s example actually behaves as the user might expect
* return a whole character after “rounding down”–difficult semantics to
define and explain, always results in a whole character but in the case of
Kevin’s example gives an unexpected answer
* returns a whole character after “rounding up”–difficult semantics to
define and explain, always results in a whole character but when the index
is misaligned would result in a character or range of characters in which
the index is not found
* trap–simple semantics, never returns garbage, obvious disadvantage that
execution will not proceed

No clearly perfect answer here. However, _if_ we hew strictly to the stated
premise of your proposal that failable APIs are awkward enough to justify a
change, and moreover that the awkwardness is truly “needless” because of
the rarity of misaligned index usage, then at face value trapping should be
a perfectly acceptable solution.

That Kevin’s example raises the specter of trapping being a realistic
occurrence in currently working code actually suggests a challenge to your
stated premise. If we accept that this challenge is a substantial one, then
it’s not clear to me that abandoning failable APIs should be ruled out from
the outset.

However, if this desire to remove failable APIs remains strong then I
wonder if the undiscussed second option above is worth at least some
consideration.

···

On Wed, Jun 14, 2017 at 08:49 Dave Abrahams <dabrahams@apple.com> wrote:

> On Jun 13, 2017, at 6:16 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
>
> I’m coming to this conversation rather late, so forgive the naive
question:
>
> Your proposal claims that current code with failable APIs is needlessly
awkward and that most code only interchanges indices that are known to
succeed. So, why is it not simply a precondition of string slicing that the
index be correctly aligned? It seems like this would simplify the behavior
greatly.

Well, consider the case raised by Kevin Ballard if nothing else: that code
would start trapping.

-Dave


(Dave Abrahams) #14

I think you're misunderstanding the motivation here. It's not so much
that I want to remove failable APIs as that I want to reduce overall API
surface area. The current index conversion APIs contribute 16
initializers and 16 methods to the overall size of the library.

···

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

If we leave aside for a moment the nomenclature issue where everything in
Foundation referring to a character is really referring to a Unicode
scalar, Kevin’s example illustrates the whole problem in a nutshell,
doesn’t it? In that example, we have a straightforward attempt to slice
with a misaligned index. The totality of options here are:

* return nil, an option the rejection of which is the premise of your
proposal
* return a partial character (i.e., \u{301}), an option which we haven’t
yet talked about in this thread–seems like this could have simpler
semantics, potentially yields garbage if the index is garbage but in the
case of Kevin’s example actually behaves as the user might expect
* return a whole character after “rounding down”–difficult semantics to
define and explain, always results in a whole character but in the case of
Kevin’s example gives an unexpected answer
* returns a whole character after “rounding up”–difficult semantics to
define and explain, always results in a whole character but when the index
is misaligned would result in a character or range of characters in which
the index is not found
* trap–simple semantics, never returns garbage, obvious disadvantage that
execution will not proceed

No clearly perfect answer here. However, _if_ we hew strictly to the stated
premise of your proposal that failable APIs are awkward enough to justify a
change, and moreover that the awkwardness is truly “needless” because of
the rarity of misaligned index usage, then at face value trapping should be
a perfectly acceptable solution.

That Kevin’s example raises the specter of trapping being a realistic
occurrence in currently working code actually suggests a challenge to your
stated premise. If we accept that this challenge is a substantial one, then
it’s not clear to me that abandoning failable APIs should be ruled out from
the outset.

However, if this desire to remove failable APIs remains strong then I
wonder if the undiscussed second option above is worth at least some
consideration.

--
-Dave


(Dave Abrahams) #15

Well, yeah, and impossible. Collection conformance requires that
subscript return a non-optional Element.

···

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

On Wed, Jun 14, 2017 at 11:13 AM, Dave Abrahams <dabrahams@apple.com> wrote:

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

> However, if this desire to remove failable APIs remains strong then I
> wonder if the undiscussed second option above is worth at least some
> consideration.

I think you're misunderstanding the motivation here. It's not so much
that I want to remove failable APIs as that I want to reduce overall API
surface area. The current index conversion APIs contribute 16
initializers and 16 methods to the overall size of the library.

Ah, and presumably, having only failable APIs once these different index
types are collapsed into one would be too cumbersome.

--
-Dave


(David Waite) #16

<snipped>

So is the idea of the Index struct is that the encodedOffset is an
offset in the native representation of the string (byte offset, word
offset, etc) to the start of a grapheme, and transcodedOffset is data
for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
within a grapheme to a code point or code unit?

Almost. First, remember that transcodedOffset is currently just a
conceptual thing and not part of the proposed API. But if we exposed
it, the following would be true:

s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil

and, because the native encoding of Strings is currently always UTF-16 compatible

s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil

In other words, a non-zero transcodedOffset can only occur in indices
from views that represent the string as code units in something other
than its native encoding, and only if that view is not UTF-32.

My main misconception appears to be that the implementation would track the beginning of a grapheme as an offset of code units, with additional tracking of the offset within a grapheme to a code unit or of state during transcoding. This would allow an index to track if it is misaligned with regard to the string, to make translations of indexes safer.

Thinking about this more, it would cause creating an index from an encodedOffset or incrementing an index to be a potentially O(n) operation as it walks the string tracking grapheme clusters.

or to specify that an index to the same character in two normalized
strings may be different if one is backed by UTF-8 and the other
UTF-16. “encodedCharacterOffset” may be better.

In what way does bringing the word “Character” into this improve things?

It doesn’t; it is based on my misconception above :slight_smile:

or strings using a stateful character encoding like ISO/IEC 2022.

I don't believe it prevents that either. The index already has state to
avoid repeating work when in a loop such as:

  var i = someView.startIndex
  while i != someView.endIndex {
     somethingWith(someView[i]) // 1
     i = someView.index(after: i) // 2
  }

where lines 1 and 2 both require determining the extent of the element
in underlying code units. There's no reason it couldn't acquire
additional state.

The most efficient way to deal with a String in a particular encoding is
to make a new instance of StringProtocol (say ISO_IEC_2022String), which
would not have to use this index type.

It is planned that eventually String could actually use something like
ISO_IEC_2022String as its backing store. At that point, we'd have a
choice:

1. Allow String.Index to store arbitrary state, burdening it with the
  cost of potential ARC traffic, or

2. Create a limited “scratch space” using fundamental types (e.g., one
  UInt) that every instance of StringProtocol would have to be able to
  use to represent its state.

Yes, this is what I was thinking, the Index becomes more complex as the # of types the system is leveraging the Index for state grows.

-DW

···

On Jun 13, 2017, at 3:21 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:
on Mon Jun 12 2017, David Waite <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:


(Xiaodi Wu) #17

Ah, and presumably, having only failable APIs once these different index
types are collapsed into one would be too cumbersome.

···

On Wed, Jun 14, 2017 at 11:13 AM, Dave Abrahams <dabrahams@apple.com> wrote:

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

> If we leave aside for a moment the nomenclature issue where everything in
> Foundation referring to a character is really referring to a Unicode
> scalar, Kevin’s example illustrates the whole problem in a nutshell,
> doesn’t it? In that example, we have a straightforward attempt to slice
> with a misaligned index. The totality of options here are:
>
> * return nil, an option the rejection of which is the premise of your
> proposal
> * return a partial character (i.e., \u{301}), an option which we haven’t
> yet talked about in this thread–seems like this could have simpler
> semantics, potentially yields garbage if the index is garbage but in the
> case of Kevin’s example actually behaves as the user might expect
> * return a whole character after “rounding down”–difficult semantics to
> define and explain, always results in a whole character but in the case
of
> Kevin’s example gives an unexpected answer
> * returns a whole character after “rounding up”–difficult semantics to
> define and explain, always results in a whole character but when the
index
> is misaligned would result in a character or range of characters in which
> the index is not found
> * trap–simple semantics, never returns garbage, obvious disadvantage that
> execution will not proceed
>
> No clearly perfect answer here. However, _if_ we hew strictly to the
stated
> premise of your proposal that failable APIs are awkward enough to
justify a
> change, and moreover that the awkwardness is truly “needless” because of
> the rarity of misaligned index usage, then at face value trapping should
be
> a perfectly acceptable solution.
>
> That Kevin’s example raises the specter of trapping being a realistic
> occurrence in currently working code actually suggests a challenge to
your
> stated premise. If we accept that this challenge is a substantial one,
then
> it’s not clear to me that abandoning failable APIs should be ruled out
from
> the outset.
>
> However, if this desire to remove failable APIs remains strong then I
> wonder if the undiscussed second option above is worth at least some
> consideration.

I think you're misunderstanding the motivation here. It's not so much
that I want to remove failable APIs as that I want to reduce overall API
surface area. The current index conversion APIs contribute 16
initializers and 16 methods to the overall size of the library.


(Xiaodi Wu) #18

If we leave aside for a moment the nomenclature issue where everything in
Foundation referring to a character is really referring to a Unicode
scalar, Kevin’s example illustrates the whole problem in a nutshell,
doesn’t it? In that example, we have a straightforward attempt to slice
with a misaligned index. The totality of options here are:

* return nil, an option the rejection of which is the premise of your
proposal
* return a partial character (i.e., \u{301}), an option which we haven’t
yet talked about in this thread–seems like this could have simpler
semantics, potentially yields garbage if the index is garbage but in the
case of Kevin’s example actually behaves as the user might expect
* return a whole character after “rounding down”–difficult semantics to
define and explain, always results in a whole character but in the case of
Kevin’s example gives an unexpected answer
* returns a whole character after “rounding up”–difficult semantics to
define and explain, always results in a whole character but when the index
is misaligned would result in a character or range of characters in which
the index is not found
* trap–simple semantics, never returns garbage, obvious disadvantage that
execution will not proceed

No clearly perfect answer here. However, _if_ we hew strictly to the
stated premise of your proposal that failable APIs are awkward enough to
justify a change, and moreover that the awkwardness is truly “needless”
because of the rarity of misaligned index usage, then at face value
trapping should be a perfectly acceptable solution.

That Kevin’s example raises the specter of trapping being a realistic
occurrence in currently working code actually suggests a challenge to your
stated premise. If we accept that this challenge is a substantial one, then
it’s not clear to me that abandoning failable APIs should be ruled out from
the outset.

However, if this desire to remove failable APIs remains strong then I
wonder if the undiscussed second option above is worth at least some
consideration.

Having digested your revised proposed behavior a little better I see you’re
kind of getting at this exact issue, but I’m uncomfortable with how it’s so
tied to the underlying encoding, which is not guaranteed to be UTF-16 but
is assumed to be for the purposes of slicing. I’d like to propose an
alternative that attempts to deliver on what I’ve called the second option
above–somewhat similar:

A string index will notionally or actually keep track of the view in which
it was originally aligned, be it utf8, utf16, unicodeScalars, or
characters. A slicing operation str.xxx[idx] will behave as expected if idx
is not misaligned with respect to str.xxx. If it is misaligned, the
operation would instead be notionally String(str.yyy[idx...]).xxx.first!,
where yyy is the original view in which idx was known aligned–if idx is not
also misaligned with respect to str.yyy (as might be the case if idx was
returned from an operation on a different string). If it is still
misaligned, trap.

···

On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Jun 14, 2017 at 08:49 Dave Abrahams <dabrahams@apple.com> wrote:

> On Jun 13, 2017, at 6:16 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:
>
> I’m coming to this conversation rather late, so forgive the naive
question:
>
> Your proposal claims that current code with failable APIs is needlessly
awkward and that most code only interchanges indices that are known to
succeed. So, why is it not simply a precondition of string slicing that the
index be correctly aligned? It seems like this would simplify the behavior
greatly.

Well, consider the case raised by Kevin Ballard if nothing else: that
code would start trapping.

-Dave


(Dave Abrahams) #19

If we leave aside for a moment the nomenclature issue where everything in
Foundation referring to a character is really referring to a Unicode
scalar, Kevin’s example illustrates the whole problem in a nutshell,
doesn’t it? In that example, we have a straightforward attempt to slice
with a misaligned index. The totality of options here are:

* return nil, an option the rejection of which is the premise of your
proposal
* return a partial character (i.e., \u{301}), an option which we haven’t
yet talked about in this thread–seems like this could have simpler
semantics, potentially yields garbage if the index is garbage but in the
case of Kevin’s example actually behaves as the user might expect

I think that's exactly what I was proposing in
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170612/037466.html

* return a whole character after “rounding down”–difficult semantics
to define and explain, always results in a whole character but in the
case of Kevin’s example gives an unexpected answer * returns a whole
character after “rounding up”–difficult semantics to define and
explain, always results in a whole character but when the index is
misaligned would result in a character or range of characters in
which the index is not found * trap–simple semantics, never returns
garbage, obvious disadvantage that execution will not proceed

No clearly perfect answer here. However, _if_ we hew strictly to the
stated premise of your proposal that failable APIs are awkward enough to
justify a change, and moreover that the awkwardness is truly “needless”
because of the rarity of misaligned index usage, then at face value
trapping should be a perfectly acceptable solution.

That Kevin’s example raises the specter of trapping being a realistic
occurrence in currently working code actually suggests a challenge to your
stated premise. If we accept that this challenge is a substantial one, then
it’s not clear to me that abandoning failable APIs should be ruled out from
the outset.

However, if this desire to remove failable APIs remains strong then I
wonder if the undiscussed second option above is worth at least some
consideration.

Having digested your revised proposed behavior a little better I see you’re
kind of getting at this exact issue, but I’m uncomfortable with how it’s so
tied to the underlying encoding, which is not guaranteed to be UTF-16 but
is assumed to be for the purposes of slicing.

I think there's some confusion here; probably I have failed to explain
myself. Today a String happens to always be UTF-16, but there's no
intention to assume that it is UTF-16 for the purposes of slicing in the
future. Any place you see something like s.utf16 in an example I've
used to illustrate semantics should be interpreted as a s.codeUnits,
where codeUnits is a collection of code units for whatever the
underlying encoding is.

Tying this to underlying encoding actually reflects the true nature of
String, which is exposed by the semantics of concatenation and range
replacement, where multiple elements may merge into one element). As
stated in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again
the elements of a String (or any of its views other than native code
units) is an emergent property. To anyone operating at Unicode scalar
granularity (which can result in misalignment with respect to
characters) or at the higher granularity of code units (native or
transcoded, which can result in misalignment with all other views), I
think this is actually unsurprising.

I’d like to propose an alternative that attempts to deliver on what
I’ve called the second option above–somewhat similar:

A string index will notionally or actually keep track of the view in which
it was originally aligned, be it utf8, utf16, unicodeScalars, or
characters. A slicing operation str.xxx[idx] will behave as expected if idx
is not misaligned with respect to str.xxx. If it is misaligned, the
operation would instead be notionally String(str.yyy[idx...]).xxx.first!,
where yyy is the original view in which idx was known aligned–if idx is not
also misaligned with respect to str.yyy (as might be the case if idx was
returned from an operation on a different string). If it is still
misaligned, trap.

That seems much more complicsted than what I'm proposing, but maybe
that's because I haven't yet explained myself clearly enough.

···

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

--
-Dave


(Dave Abrahams) #20

>
>> If we leave aside for a moment the nomenclature issue where everything
in
>> Foundation referring to a character is really referring to a Unicode
>> scalar, Kevin’s example illustrates the whole problem in a nutshell,
>> doesn’t it? In that example, we have a straightforward attempt to slice
>> with a misaligned index. The totality of options here are:
>>
>> * return nil, an option the rejection of which is the premise of your
>> proposal
>> * return a partial character (i.e., \u{301}), an option which we haven’t
>> yet talked about in this thread–seems like this could have simpler
>> semantics, potentially yields garbage if the index is garbage but in the
>> case of Kevin’s example actually behaves as the user might expect

I think that's exactly what I was proposing in
https://lists.swift.org/pipermail/swift-evolution/
Week-of-Mon-20170612/037466.html

>> * return a whole character after “rounding down”–difficult semantics
>> to define and explain, always results in a whole character but in the
>> case of Kevin’s example gives an unexpected answer * returns a whole
>> character after “rounding up”–difficult semantics to define and
>> explain, always results in a whole character but when the index is
>> misaligned would result in a character or range of characters in
>> which the index is not found * trap–simple semantics, never returns
>> garbage, obvious disadvantage that execution will not proceed
>>
>> No clearly perfect answer here. However, _if_ we hew strictly to the
>> stated premise of your proposal that failable APIs are awkward enough to
>> justify a change, and moreover that the awkwardness is truly “needless”
>> because of the rarity of misaligned index usage, then at face value
>> trapping should be a perfectly acceptable solution.
>>
>> That Kevin’s example raises the specter of trapping being a realistic
>> occurrence in currently working code actually suggests a challenge to
your
>> stated premise. If we accept that this challenge is a substantial one,
then
>> it’s not clear to me that abandoning failable APIs should be ruled out
from
>> the outset.
>>
>> However, if this desire to remove failable APIs remains strong then I
>> wonder if the undiscussed second option above is worth at least some
>> consideration.
>>
>
> Having digested your revised proposed behavior a little better I see
you’re
> kind of getting at this exact issue, but I’m uncomfortable with how it’s
so
> tied to the underlying encoding, which is not guaranteed to be UTF-16 but
> is assumed to be for the purposes of slicing.

I think there's some confusion here; probably I have failed to explain
myself. Today a String happens to always be UTF-16, but there's no
intention to assume that it is UTF-16 for the purposes of slicing in the
future. Any place you see something like s.utf16 in an example I've
used to illustrate semantics should be interpreted as a s.codeUnits,
where codeUnits is a collection of code units for whatever the
underlying encoding is.

Tying this to underlying encoding actually reflects the true nature of
String, which is exposed by the semantics of concatenation and range
replacement, where multiple elements may merge into one element). As
stated in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-
should-be-a-collection-of-characters-again
the elements of a String (or any of its views other than native code
units) is an emergent property. To anyone operating at Unicode scalar
granularity (which can result in misalignment with respect to
characters) or at the higher granularity of code units (native or
transcoded, which can result in misalignment with all other views), I
think this is actually unsurprising.

That's fair. It this is critical to the semantics, though, and you expect
that some people will operate at that granularity, it seems incongruous
that s.codeUnits isn't actually exposed to the user even if it'd be as a
type-erased AnyCollection.

I agree. Exposing .codeUnits is part of the longer-term plan, but I'm
trying to keep mostly-orthogonal issues out of this proposal.

> I’d like to propose an alternative that attempts to deliver on what
> I’ve called the second option above–somewhat similar:
>
> A string index will notionally or actually keep track of the view
> in which it was originally aligned, be it utf8, utf16,
> unicodeScalars, or characters. A slicing operation str.xxx[idx]
> will behave as expected if idx is not misaligned with respect to
> str.xxx. If it is misaligned, the operation would instead be
> notionally String(str.yyy[idx...]).xxx. first!, where yyy is the
> original view in which idx was known aligned–if idx is not also
> misaligned with respect to str.yyy (as might be the case if idx was
> returned from an operation on a different string). If it is still
> misaligned, trap.

That seems much more complicsted than what I'm proposing, but maybe
that's because I haven't yet explained myself clearly enough.

I think I catch your drift, and I'm converging on your way of thinking
here.

:slight_smile:

···

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:

On Wed, Jun 14, 2017 at 12:01 PM, Dave Abrahams <dabrahams@apple.com> wrote:

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:
> On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

--
-Dave