SE-0180: String Index Overhaul

>
>> If we leave aside for a moment the nomenclature issue where everything
in
>> Foundation referring to a character is really referring to a Unicode
>> scalar, Kevin’s example illustrates the whole problem in a nutshell,
>> doesn’t it? In that example, we have a straightforward attempt to slice
>> with a misaligned index. The totality of options here are:
>>
>> * return nil, an option the rejection of which is the premise of your
>> proposal
>> * return a partial character (i.e., \u{301}), an option which we haven’t
>> yet talked about in this thread–seems like this could have simpler
>> semantics, potentially yields garbage if the index is garbage but in the
>> case of Kevin’s example actually behaves as the user might expect

I think that's exactly what I was proposing in
The swift-evolution Archives
Week-of-Mon-20170612/037466.html

>> * return a whole character after “rounding down”–difficult semantics
>> to define and explain, always results in a whole character but in the
>> case of Kevin’s example gives an unexpected answer * returns a whole
>> character after “rounding up”–difficult semantics to define and
>> explain, always results in a whole character but when the index is
>> misaligned would result in a character or range of characters in
>> which the index is not found * trap–simple semantics, never returns
>> garbage, obvious disadvantage that execution will not proceed
>>
>> No clearly perfect answer here. However, _if_ we hew strictly to the
>> stated premise of your proposal that failable APIs are awkward enough to
>> justify a change, and moreover that the awkwardness is truly “needless”
>> because of the rarity of misaligned index usage, then at face value
>> trapping should be a perfectly acceptable solution.
>>
>> That Kevin’s example raises the specter of trapping being a realistic
>> occurrence in currently working code actually suggests a challenge to
your
>> stated premise. If we accept that this challenge is a substantial one,
then
>> it’s not clear to me that abandoning failable APIs should be ruled out
from
>> the outset.
>>
>> However, if this desire to remove failable APIs remains strong then I
>> wonder if the undiscussed second option above is worth at least some
>> consideration.
>>
>
> Having digested your revised proposed behavior a little better I see
you’re
> kind of getting at this exact issue, but I’m uncomfortable with how it’s
so
> tied to the underlying encoding, which is not guaranteed to be UTF-16 but
> is assumed to be for the purposes of slicing.

I think there's some confusion here; probably I have failed to explain
myself. Today a String happens to always be UTF-16, but there's no
intention to assume that it is UTF-16 for the purposes of slicing in the
future. Any place you see something like s.utf16 in an example I've
used to illustrate semantics should be interpreted as a s.codeUnits,
where codeUnits is a collection of code units for whatever the
underlying encoding is.

Tying this to underlying encoding actually reflects the true nature of
String, which is exposed by the semantics of concatenation and range
replacement, where multiple elements may merge into one element). As
stated in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-
should-be-a-collection-of-characters-again
the elements of a String (or any of its views other than native code
units) is an emergent property. To anyone operating at Unicode scalar
granularity (which can result in misalignment with respect to
characters) or at the higher granularity of code units (native or
transcoded, which can result in misalignment with all other views), I
think this is actually unsurprising.

That's fair. It this is critical to the semantics, though, and you expect
that some people will operate at that granularity, it seems incongruous
that s.codeUnits isn't actually exposed to the user even if it'd be as a
type-erased AnyCollection.

I’d like to propose an alternative that attempts to deliver on what
> I’ve called the second option above–somewhat similar:
>
> A string index will notionally or actually keep track of the view in
which
> it was originally aligned, be it utf8, utf16, unicodeScalars, or
> characters. A slicing operation str.xxx[idx] will behave as expected if
idx
> is not misaligned with respect to str.xxx. If it is misaligned, the
> operation would instead be notionally String(str.yyy[idx...]).xxx.
first!,
> where yyy is the original view in which idx was known aligned–if idx is
not
> also misaligned with respect to str.yyy (as might be the case if idx was
> returned from an operation on a different string). If it is still
> misaligned, trap.

That seems much more complicsted than what I'm proposing, but maybe
that's because I haven't yet explained myself clearly enough.

I think I catch your drift, and I'm converging on your way of thinking here.

···

On Wed, Jun 14, 2017 at 12:01 PM, Dave Abrahams <dabrahams@apple.com> wrote:

on Wed Jun 14 2017, Xiaodi Wu <xiaodi.wu-AT-gmail.com> wrote:
> On Wed, Jun 14, 2017 at 09:26 Xiaodi Wu <xiaodi.wu@gmail.com> wrote: