InternalString class for easy String manipulation

hexdreamer · August 17, 2016, 9:04pm

Looking at the String reference again, I see that Swift.String is subscriptable. Also, I was able to write my “split” function without using subscripting at all:

public extension String {
    public func split(_ pattern :String) -> [String] {
        var results = [String]()
        var remaining = self.startIndex..<self.endIndex;
        while let matchRange = self.range(of:pattern, options: .regularExpressionSearch, range: remaining, locale: nil) {
            results.append(self.substring(with: remaining.lowerBound..<matchRange.lowerBound))
            remaining = matchRange.upperBound..<self.endIndex
        }
        results.append(self.substring(with:remaining))
        return results
    }
}

So it seems I’ve painted myself into a corner...

-Kenny

···

On Aug 17, 2016, at 1:34 PM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

William Sumner says:
Can you be more specific about the improvements you’d like to see? Based on an earlier message, you want to be able to use subscripting on strings to retrieve visual glyphs, but you can do this now via the .characters property, which presents a view of the string’s contents as a collection of extended grapheme clusters.

I did not know about .characters. I would say this addresses the glyph portion of my issues.

I still have a problem with not being able to index using simple integers to create subscripts or ranges. I wrote my own “split" function, and found it extremely frustrating to have to work with .Index types instead of being able to use integers for subscripts and ranges. Compared to other languages, this almost obviates the usefulness of subscripts altogether. I understand that there are performance implications with translating integer subscripts into actual indexes in the string, but I guess this is a case where even generating another view on the string doesn’t do enough (internally) to make it worthwhile. Perhaps if it did… Again, this is very beginner unfriendly. I guess I will amend my definition of beginner to not only include people new to programming, but people already experienced in languages besides Swift. Now that I think about it, NSString is as powerful as Swift.String when ti comes to Unicode, and it still allows integer based indexing.

Another issue I have is that a String itself is not subscriptable, but that you have to get a view of it. I think String should have some default subscriptability that “does the right thing”, whatever that is decided to be.

<heart-to-heart on>
Now that we’re getting to the heart of the problem (thanks for the prompting me to think more deeply about it), Swift may be more frustrating to learn for experienced programmers coming from C, Objective-C, Java, Ruby, etc. You try to do the simplest think like index into a string, and then find out you can’t - you think to yourself, “I’ve been programming in Objective-C for 20 years. Why can’t I do this? Am I stupid? Is the Swift team purposely trying to make this hard for me?”

I’ve been reading swift-evolution for a long time now, and a reason often given for design decisions is “term of art”. I believe that integer-based subscriptablilty is a term of art that should be supported.
<heart-to-heart off>

On Aug 17, 2016, at 12:51 PM, Shawn Erickson <shawnce@gmail.com> wrote:

I would also like to understand the perceived problem for first time programmers. To me first time programmers would be working with string literals ("hello world"), string literals with values in them ("Hello /(name)"), doing basic string concat, using higher level API of string to do and find things in a string, etc..

I guess it’s a matter of opinion what features beginner programmers will dip their toes into, but I think string manipulation is not that far up the totem pole. Would you consider splitting a comma-separated string an advanced task?

Also, see my revised definition of beginner programmer above.

-Kenny

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

William_Sumner · August 17, 2016, 8:57pm

Note that working with individual characters of a NSString can be unsafe because a visual glyph may be represented by multiple characters. NSString provides methods like rangeOfComposedCharacterSequencesForRange: to enable you to align your character retrievals along grapheme cluster boundaries.

You may be interested in this article by Mike Ash, which gives a rationale for the String API, including why indexes aren't simple integers: mikeash.com: Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

In short, these are not simple accesses but potentially expensive operations, and integer subscripting could give users the assumption that they’re accessing arrays with no performance overhead.

Preston

···

On Aug 17, 2016, at 2:34 PM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

William Sumner says:
Can you be more specific about the improvements you’d like to see? Based on an earlier message, you want to be able to use subscripting on strings to retrieve visual glyphs, but you can do this now via the .characters property, which presents a view of the string’s contents as a collection of extended grapheme clusters.

I did not know about .characters. I would say this addresses the glyph portion of my issues.

I still have a problem with not being able to index using simple integers to create subscripts or ranges. I wrote my own “split" function, and found it extremely frustrating to have to work with .Index types instead of being able to use integers for subscripts and ranges. Compared to other languages, this almost obviates the usefulness of subscripts altogether. I understand that there are performance implications with translating integer subscripts into actual indexes in the string, but I guess this is a case where even generating another view on the string doesn’t do enough (internally) to make it worthwhile. Perhaps if it did… Again, this is very beginner unfriendly. I guess I will amend my definition of beginner to not only include people new to programming, but people already experienced in languages besides Swift. Now that I think about it, NSString is as powerful as Swift.String when ti comes to Unicode, and it still allows integer based indexing.

Another issue I have is that a String itself is not subscriptable, but that you have to get a view of it. I think String should have some default subscriptability that “does the right thing”, whatever that is decided to be.

<heart-to-heart on>
Now that we’re getting to the heart of the problem (thanks for the prompting me to think more deeply about it), Swift may be more frustrating to learn for experienced programmers coming from C, Objective-C, Java, Ruby, etc. You try to do the simplest think like index into a string, and then find out you can’t - you think to yourself, “I’ve been programming in Objective-C for 20 years. Why can’t I do this? Am I stupid? Is the Swift team purposely trying to make this hard for me?”

I’ve been reading swift-evolution for a long time now, and a reason often given for design decisions is “term of art”. I believe that integer-based subscriptablilty is a term of art that should be supported.
<heart-to-heart off>

hexdreamer · August 17, 2016, 10:03pm

Thanks for the pointer.

I guess being told *why* the String API is so hard doesn’t make me feel much better about the fact that it *is* hard.

It opens:

“One of the biggest complaints I see from people using Swift is the String API. It's difficult and obtuse, and people often wish it were more like string APIs in other languages.”

It’s been said on the list that they are thinking about rewriting the String at some point. I’m hoping that the API can be made simpler.

-Kenny

···

On Aug 17, 2016, at 1:57 PM, William Sumner <prestonsumner@me.com> wrote:

You may be interested in this article by Mike Ash, which gives a rationale for the String API, including why indexes aren't simple integers: mikeash.com: Friday Q&A 2015-11-06: Why is Swift's String API So Hard?

xwu · August 17, 2016, 10:24pm

> You may be interested in this article by Mike Ash, which gives a
rationale for the String API, including why indexes aren't simple integers:
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-
why-is-swifts-string-api-so-hard.html

Thanks for the pointer.

I guess being told *why* the String API is so hard doesn’t make me feel
much better about the fact that it *is* hard.

It opens:

“One of the biggest complaints I see from people using Swift is the String
API. It's difficult and obtuse, and people often wish it were more like
string APIs in other languages.”

It’s been said on the list that they are thinking about rewriting the
String at some point. I’m hoping that the API can be made simpler.

I too am excited to see what improvements may come.

That said, the Swift String APIs are far and away *the best* string APIs
I've ever worked with, precisely because they promote _correct_ code in so
many ways that alternative "simpler" APIs don't. It's exactly this learning
process, where you learn that index-based slicing of NSString is unsafe and
that traversing Strings character-by-character is computationally
expensive, and then you find that you don't need to use either the unsafe
or the expensive methods after all, that reveals the power of the design.
As a result, you now have a Unicode-ready *and* performant slice algorithm.
The fact that you've been guided to this end result by the API design is
precisely what makes me appreciate it so much!

···

On Wed, Aug 17, 2016 at 5:03 PM, Kenny Leung via swift-evolution < swift-evolution@swift.org> wrote:

> On Aug 17, 2016, at 1:57 PM, William Sumner <prestonsumner@me.com> > wrote:

-Kenny

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Shawn_Erickson · August 17, 2016, 7:20pm

As stated earlier it is 2016, I think the baseline should be robust Unicode
support and what we have in Swift is actually a fairly good way of dealing
with it IMHO. I think new to development folks should have this as their
baseline as well... not that we shouldn't make it as easy to work with as
possible.

-Shawn

···

On Wed, Aug 17, 2016 at 12:15 PM Kenny Leung via swift-evolution < swift-evolution@swift.org> wrote:

It seems to me that UTF-8 is the best choice to encode strings in English
and English-like character sets for storage, but it’s not clear that it is
the most useful or performant internal representation for working with
strings. In my opinion, conflating the preferred storage format and the
best internal representation is not the proper thing to do. Picking the
right internal storage format should be evaluated based on its own
criteria. Even as an experienced programmer, I assert that the most useful
indexing system is glyph based.

In Félix’s case, I would expect to have to ask for a mail-friendly
representation of his name, just like you have to ask for a
filesystem-friendly representation of a filename regardless of what the
internal representation is. Just because you are using UTF-8 as the
internal format, it does not mean that universal support is guaranteed.

In response to this statement: “Optimizing developer experience for
beginning developers is just going to lead to software that screws…”, the
current system trips up not only beginning developers, but is different
from pretty much every programming language in my experience.

-Kenny

> On Aug 17, 2016, at 11:48 AM, Zach Waldowski via swift-evolution < > swift-evolution@swift.org> wrote:
>
> It's 2016, "the thing people would most commonly expect"
> impossible-to-screw-up Unicode support that's performance. Optimizing
> developer experience for beginning developers is just going to lead to
> software that screws up in situations the developer doesn't anticipate,
> as F+¬lix notes above.
>
> Zachary
>
> On Wed, Aug 17, 2016, at 09:40 AM, Kenny Leung via swift-evolution > > wrote:
>> I understand that the most friendly approach may not be the most
>> efficient, but that’s not what I’m pushing for. I’m pushing for "does
the
>> thing people would most commonly expect”. Take a first-time programmer
>> who reads any (human) language, and that is what they would expect.
>>
>> Why couldn’t String’s internal storage format be glyph-based? If I were,
>> say, writing a text editor, it would certainly be the easiest and most
>> efficient format to work in.
>>
>> -Kenny
>>
>>
>>> On Aug 15, 2016, at 9:20 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:
>>>
>>> The major problem with this approach is that visual glyphs themselves
have one level of variable-length encoding, and they sit on top of another
variable-length encoding used to represent the Unicode characters
(Swift-native Strings are currently encoded as UTF-8). For instance, the
visual glyph is the the result of putting side-by-side the Unicode
characters 🇺 and 🇸("REGIONAL INDICATOR SYMBOL LETTER U" and "REGIONAL
INDICATOR SYMBOL LETTER S"), which are themselves encoded as UTF-8 using 4
bytes each. A design in which you can "just write" string[4544] hides the
fact that indexing is a linear-time operation that needs to recompose UTF-8
characters and then recompose visual glyphs on top of that.
>>>
>>> Generally speaking, I *think* that I agree that human-geared "long
string" on which you probably won't need random access, and machine-geared
smaller strings that encode a command, could benefit from not being
considered the same fundamental thing. However, I'm also afraid that this
will end with more applications and websites that think that first names
only contain 7-bit-clean characters in the A-Z range. (I live in the US and
I can attest that this is still very common.)
>>>
>>> You could make a point too that better facilities to parse strings
would probably address this issue.
>>>
>>> Félix
>>>
>>>> Le 15 août 2016 à 10:52:02, Kenny Leung via swift-evolution < > swift-evolution@swift.org> a écrit :
>>>>
>>>> I agree with both points of view. I think we need to bring back
subscripting on strings which does the thing people would most commonly
expect.
>>>>
>>>> I would say that the subscripts indexes should correspond to a visual
glyph. This seems reasonable to me for most character sets like Roman,
Cyrillic, Chinese. There is some doubt in my mind for things like
subscripted Japanese or connected (ligatured?) languages like Arabic, Hindi
or Thai.
>>>>
>>>> -Kenny
>>>>
>>>>
>>>>> On Aug 15, 2016, at 10:42 AM, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:
>>>>>
>>>>> On Sun, Aug 14, 2016 at 5:41 PM, Michael Savich via swift-evolution < > swift-evolution@swift.org> wrote:
>>>>> Back in Swift 1.0, subscripting a String was easy, you could just
use subscripting in a very Python like way. But now, things are a bit more
complicated. I recognize why we need syntax like
str.startIndex.advancedBy(x) but it has its downsides. Namely, it makes
things hard on beginners. If one of Swift's goals is to make it a great
first language, this syntax fights that. Imagine having to explain Unicode
and character size to an 8 year old. This is doubly problematic because
String manipulation is one of the first things new coders might want to do.
>>>>>
>>>>> What about having an InternalString subclass that only supports one
encoding, allowing it to be subscripted with Ints? The idea is that an
InternalString is for Strings that are more or less hard coded into the
app. Dictionary keys, enum raw values, that kind of stuff. This also has
the added benefit of forcing the programmer to think about what the String
is being used for. Is it user facing? Or is it just for internal use? And
of course, it makes code dealing with String manipulation much more concise
and readable.
>>>>>
>>>>> It follows that something like this would need to be entered as a
literal to make it as easy as using String. One way would be to make all
String literals InternalStrings, but that sounds far too drastic. Maybe
appending an exclamation point like "this"! Or even just wrapping the whole
thing in exclamation marks like !"this"! Of course, we could go old school
and write it like @"this" …That last one is a joke.
>>>>>
>>>>> I'll be the first to admit I'm way in over my head here, so I'm very
open to suggestions and criticism. Thanks!
>>>>>
>>>>> I can sympathize, but this is tricky.
>>>>>
>>>>> Fundamentally, if it's going to be a learning and teaching issue,
then this "easy" string should be the default. That is to say, if I write
`var a = "Hello, world!"`, then `a` should be inferred to be of type
InternalString or EasyString, whatever you want to call it.
>>>>>
>>>>> But, we also want Swift to support Unicode by default, and we want
that support to do things The Right Way(TM) by default. In other words, a
user should not have to reach for a special type in order to handle
arbitrary strings correctly, and I should be able to reassign `a = "你好"`
and have things work as expected. So, we also can't have the "easy" string
type be the default...
>>>>>
>>>>> I can't think of a way to square that circle.
>>>>>
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>
>>>> _______________________________________________
>>>> swift-evolution mailing list
>>>> swift-evolution@swift.org
>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Felix_Cloutier1 · August 18, 2016, 5:40am

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

Would you imagine if "n" turned out to be poorly supported by systems throughout the world and dead-serious people argued that it's too hard for beginners?

"Filesystem-friendly" and "email-friendly" names are not backed by modern standards. You can have essentially any character that you like in a file name save for the directory separator on almost every platform out there (except on Windows, but the constraints are implemented in a layer above NTFS), and addresses like félix@... are RFC-legal. Restrictions are merely wished into existence by programmers who don't want to complicate their mental model of text processing, to everyone else's detriment.

Félix

hexdreamer · August 17, 2016, 7:27pm

As stated earlier it is 2016

I don’t like the tone attached to this statement.

I think the baseline should be robust Unicode support

I don’t understand how anything I have pushed for would compromise robust Unicode support.

and what we have in Swift is actually a fairly good way of dealing with it IMHO. I think new to development folks should have this as their baseline as well…

not that we shouldn't make it as easy to work with as possible.

Regardless of internal representation, wouldn’t this be a glyph-based indexing system?

-Kenny

···

On Aug 17, 2016, at 12:20 PM, Shawn Erickson <shawnce@gmail.com> wrote:

-Shawn

On Wed, Aug 17, 2016 at 12:15 PM Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:
It seems to me that UTF-8 is the best choice to encode strings in English and English-like character sets for storage, but it’s not clear that it is the most useful or performant internal representation for working with strings. In my opinion, conflating the preferred storage format and the best internal representation is not the proper thing to do. Picking the right internal storage format should be evaluated based on its own criteria. Even as an experienced programmer, I assert that the most useful indexing system is glyph based.

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

In response to this statement: “Optimizing developer experience for beginning developers is just going to lead to software that screws…”, the current system trips up not only beginning developers, but is different from pretty much every programming language in my experience.

-Kenny

> On Aug 17, 2016, at 11:48 AM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:
>
> It's 2016, "the thing people would most commonly expect"
> impossible-to-screw-up Unicode support that's performance. Optimizing
> developer experience for beginning developers is just going to lead to
> software that screws up in situations the developer doesn't anticipate,
> as F+¬lix notes above.
>
> Zachary
>
> On Wed, Aug 17, 2016, at 09:40 AM, Kenny Leung via swift-evolution > > wrote:
>> I understand that the most friendly approach may not be the most
>> efficient, but that’s not what I’m pushing for. I’m pushing for "does the
>> thing people would most commonly expect”. Take a first-time programmer
>> who reads any (human) language, and that is what they would expect.
>>
>> Why couldn’t String’s internal storage format be glyph-based? If I were,
>> say, writing a text editor, it would certainly be the easiest and most
>> efficient format to work in.
>>
>> -Kenny
>>
>>
>>> On Aug 15, 2016, at 9:20 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:
>>>
>>> The major problem with this approach is that visual glyphs themselves have one level of variable-length encoding, and they sit on top of another variable-length encoding used to represent the Unicode characters (Swift-native Strings are currently encoded as UTF-8). For instance, the visual glyph is the the result of putting side-by-side the Unicode characters 🇺 and 🇸("REGIONAL INDICATOR SYMBOL LETTER U" and "REGIONAL INDICATOR SYMBOL LETTER S"), which are themselves encoded as UTF-8 using 4 bytes each. A design in which you can "just write" string[4544] hides the fact that indexing is a linear-time operation that needs to recompose UTF-8 characters and then recompose visual glyphs on top of that.
>>>
>>> Generally speaking, I *think* that I agree that human-geared "long string" on which you probably won't need random access, and machine-geared smaller strings that encode a command, could benefit from not being considered the same fundamental thing. However, I'm also afraid that this will end with more applications and websites that think that first names only contain 7-bit-clean characters in the A-Z range. (I live in the US and I can attest that this is still very common.)
>>>
>>> You could make a point too that better facilities to parse strings would probably address this issue.
>>>
>>> Félix
>>>
>>>> Le 15 août 2016 à 10:52:02, Kenny Leung via swift-evolution <swift-evolution@swift.org> a écrit :
>>>>
>>>> I agree with both points of view. I think we need to bring back subscripting on strings which does the thing people would most commonly expect.
>>>>
>>>> I would say that the subscripts indexes should correspond to a visual glyph. This seems reasonable to me for most character sets like Roman, Cyrillic, Chinese. There is some doubt in my mind for things like subscripted Japanese or connected (ligatured?) languages like Arabic, Hindi or Thai.
>>>>
>>>> -Kenny
>>>>
>>>>
>>>>> On Aug 15, 2016, at 10:42 AM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:
>>>>>
>>>>> On Sun, Aug 14, 2016 at 5:41 PM, Michael Savich via swift-evolution <swift-evolution@swift.org> wrote:
>>>>> Back in Swift 1.0, subscripting a String was easy, you could just use subscripting in a very Python like way. But now, things are a bit more complicated. I recognize why we need syntax like str.startIndex.advancedBy(x) but it has its downsides. Namely, it makes things hard on beginners. If one of Swift's goals is to make it a great first language, this syntax fights that. Imagine having to explain Unicode and character size to an 8 year old. This is doubly problematic because String manipulation is one of the first things new coders might want to do.
>>>>>
>>>>> What about having an InternalString subclass that only supports one encoding, allowing it to be subscripted with Ints? The idea is that an InternalString is for Strings that are more or less hard coded into the app. Dictionary keys, enum raw values, that kind of stuff. This also has the added benefit of forcing the programmer to think about what the String is being used for. Is it user facing? Or is it just for internal use? And of course, it makes code dealing with String manipulation much more concise and readable.
>>>>>
>>>>> It follows that something like this would need to be entered as a literal to make it as easy as using String. One way would be to make all String literals InternalStrings, but that sounds far too drastic. Maybe appending an exclamation point like "this"! Or even just wrapping the whole thing in exclamation marks like !"this"! Of course, we could go old school and write it like @"this" …That last one is a joke.
>>>>>
>>>>> I'll be the first to admit I'm way in over my head here, so I'm very open to suggestions and criticism. Thanks!
>>>>>
>>>>> I can sympathize, but this is tricky.
>>>>>
>>>>> Fundamentally, if it's going to be a learning and teaching issue, then this "easy" string should be the default. That is to say, if I write `var a = "Hello, world!"`, then `a` should be inferred to be of type InternalString or EasyString, whatever you want to call it.
>>>>>
>>>>> But, we also want Swift to support Unicode by default, and we want that support to do things The Right Way(TM) by default. In other words, a user should not have to reach for a special type in order to handle arbitrary strings correctly, and I should be able to reassign `a = "你好"` and have things work as expected. So, we also can't have the "easy" string type be the default...
>>>>>
>>>>> I can't think of a way to square that circle.
>>>>>
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>
>>>> _______________________________________________
>>>> swift-evolution mailing list
>>>> swift-evolution@swift.org
>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

BigZaphod · August 17, 2016, 7:24pm

I’m not sure what the current year of the Gregorian calendar has to do with strings. :P

l8r
Sean

···

On Aug 17, 2016, at 2:20 PM, Shawn Erickson via swift-evolution <swift-evolution@swift.org> wrote:

As stated earlier it is 2016, I think the baseline should be robust Unicode support and what we have in Swift is actually a fairly good way of dealing with it IMHO. I think new to development folks should have this as their baseline as well... not that we shouldn't make it as easy to work with as possible.

-Shawn

On Wed, Aug 17, 2016 at 12:15 PM Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:
It seems to me that UTF-8 is the best choice to encode strings in English and English-like character sets for storage, but it’s not clear that it is the most useful or performant internal representation for working with strings. In my opinion, conflating the preferred storage format and the best internal representation is not the proper thing to do. Picking the right internal storage format should be evaluated based on its own criteria. Even as an experienced programmer, I assert that the most useful indexing system is glyph based.

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

In response to this statement: “Optimizing developer experience for beginning developers is just going to lead to software that screws…”, the current system trips up not only beginning developers, but is different from pretty much every programming language in my experience.

-Kenny

> On Aug 17, 2016, at 11:48 AM, Zach Waldowski via swift-evolution <swift-evolution@swift.org> wrote:
>
> It's 2016, "the thing people would most commonly expect"
> impossible-to-screw-up Unicode support that's performance. Optimizing
> developer experience for beginning developers is just going to lead to
> software that screws up in situations the developer doesn't anticipate,
> as F+¬lix notes above.
>
> Zachary
>
> On Wed, Aug 17, 2016, at 09:40 AM, Kenny Leung via swift-evolution > > wrote:
>> I understand that the most friendly approach may not be the most
>> efficient, but that’s not what I’m pushing for. I’m pushing for "does the
>> thing people would most commonly expect”. Take a first-time programmer
>> who reads any (human) language, and that is what they would expect.
>>
>> Why couldn’t String’s internal storage format be glyph-based? If I were,
>> say, writing a text editor, it would certainly be the easiest and most
>> efficient format to work in.
>>
>> -Kenny
>>
>>
>>> On Aug 15, 2016, at 9:20 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:
>>>
>>> The major problem with this approach is that visual glyphs themselves have one level of variable-length encoding, and they sit on top of another variable-length encoding used to represent the Unicode characters (Swift-native Strings are currently encoded as UTF-8). For instance, the visual glyph is the the result of putting side-by-side the Unicode characters 🇺 and 🇸("REGIONAL INDICATOR SYMBOL LETTER U" and "REGIONAL INDICATOR SYMBOL LETTER S"), which are themselves encoded as UTF-8 using 4 bytes each. A design in which you can "just write" string[4544] hides the fact that indexing is a linear-time operation that needs to recompose UTF-8 characters and then recompose visual glyphs on top of that.
>>>
>>> Generally speaking, I *think* that I agree that human-geared "long string" on which you probably won't need random access, and machine-geared smaller strings that encode a command, could benefit from not being considered the same fundamental thing. However, I'm also afraid that this will end with more applications and websites that think that first names only contain 7-bit-clean characters in the A-Z range. (I live in the US and I can attest that this is still very common.)
>>>
>>> You could make a point too that better facilities to parse strings would probably address this issue.
>>>
>>> Félix
>>>
>>>> Le 15 août 2016 à 10:52:02, Kenny Leung via swift-evolution <swift-evolution@swift.org> a écrit :
>>>>
>>>> I agree with both points of view. I think we need to bring back subscripting on strings which does the thing people would most commonly expect.
>>>>
>>>> I would say that the subscripts indexes should correspond to a visual glyph. This seems reasonable to me for most character sets like Roman, Cyrillic, Chinese. There is some doubt in my mind for things like subscripted Japanese or connected (ligatured?) languages like Arabic, Hindi or Thai.
>>>>
>>>> -Kenny
>>>>
>>>>
>>>>> On Aug 15, 2016, at 10:42 AM, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:
>>>>>
>>>>> On Sun, Aug 14, 2016 at 5:41 PM, Michael Savich via swift-evolution <swift-evolution@swift.org> wrote:
>>>>> Back in Swift 1.0, subscripting a String was easy, you could just use subscripting in a very Python like way. But now, things are a bit more complicated. I recognize why we need syntax like str.startIndex.advancedBy(x) but it has its downsides. Namely, it makes things hard on beginners. If one of Swift's goals is to make it a great first language, this syntax fights that. Imagine having to explain Unicode and character size to an 8 year old. This is doubly problematic because String manipulation is one of the first things new coders might want to do.
>>>>>
>>>>> What about having an InternalString subclass that only supports one encoding, allowing it to be subscripted with Ints? The idea is that an InternalString is for Strings that are more or less hard coded into the app. Dictionary keys, enum raw values, that kind of stuff. This also has the added benefit of forcing the programmer to think about what the String is being used for. Is it user facing? Or is it just for internal use? And of course, it makes code dealing with String manipulation much more concise and readable.
>>>>>
>>>>> It follows that something like this would need to be entered as a literal to make it as easy as using String. One way would be to make all String literals InternalStrings, but that sounds far too drastic. Maybe appending an exclamation point like "this"! Or even just wrapping the whole thing in exclamation marks like !"this"! Of course, we could go old school and write it like @"this" …That last one is a joke.
>>>>>
>>>>> I'll be the first to admit I'm way in over my head here, so I'm very open to suggestions and criticism. Thanks!
>>>>>
>>>>> I can sympathize, but this is tricky.
>>>>>
>>>>> Fundamentally, if it's going to be a learning and teaching issue, then this "easy" string should be the default. That is to say, if I write `var a = "Hello, world!"`, then `a` should be inferred to be of type InternalString or EasyString, whatever you want to call it.
>>>>>
>>>>> But, we also want Swift to support Unicode by default, and we want that support to do things The Right Way(TM) by default. In other words, a user should not have to reach for a special type in order to handle arbitrary strings correctly, and I should be able to reassign `a = "你好"` and have things work as expected. So, we also can't have the "easy" string type be the default...
>>>>>
>>>>> I can't think of a way to square that circle.
>>>>>
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution@swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>
>>>> _______________________________________________
>>>> swift-evolution mailing list
>>>> swift-evolution@swift.org
>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution@swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Shawn_Erickson · August 17, 2016, 7:36pm

I like a "view" based system when looking at a Unicode string. It lets you
pick the view of string - defining how it is indexed - based on your needs.
A view could be indexed by a human facing glyph, a particular Unicode
encoding style, a decompose style, etc.

I think that is powerful, useful, and exposes the real complexity in a
manageable and functional way.

In many domains you would never need to care about indexing across a view
or even using a view to work with a string.

···

On Wed, Aug 17, 2016 at 12:27 PM Kenny Leung via swift-evolution < swift-evolution@swift.org> wrote:

>
> On Aug 17, 2016, at 12:20 PM, Shawn Erickson <shawnce@gmail.com> wrote:
>
> As stated earlier it is 2016

I don’t like the tone attached to this statement.

> I think the baseline should be robust Unicode support

I don’t understand how anything I have pushed for would compromise robust
Unicode support.

> and what we have in Swift is actually a fairly good way of dealing with
it IMHO. I think new to development folks should have this as their
baseline as well…

> not that we shouldn't make it as easy to work with as possible.

Regardless of internal representation, wouldn’t this be a glyph-based
indexing system?

-Kenny

>
> -Shawn
>
> On Wed, Aug 17, 2016 at 12:15 PM Kenny Leung via swift-evolution < > swift-evolution@swift.org> wrote:
> It seems to me that UTF-8 is the best choice to encode strings in
English and English-like character sets for storage, but it’s not clear
that it is the most useful or performant internal representation for
working with strings. In my opinion, conflating the preferred storage
format and the best internal representation is not the proper thing to do.
Picking the right internal storage format should be evaluated based on its
own criteria. Even as an experienced programmer, I assert that the most
useful indexing system is glyph based.
>
> In Félix’s case, I would expect to have to ask for a mail-friendly
representation of his name, just like you have to ask for a
filesystem-friendly representation of a filename regardless of what the
internal representation is. Just because you are using UTF-8 as the
internal format, it does not mean that universal support is guaranteed.
>
> In response to this statement: “Optimizing developer experience for
beginning developers is just going to lead to software that screws…”, the
current system trips up not only beginning developers, but is different
from pretty much every programming language in my experience.
>
> -Kenny
>
>
> > On Aug 17, 2016, at 11:48 AM, Zach Waldowski via swift-evolution < > swift-evolution@swift.org> wrote:
> >
> > It's 2016, "the thing people would most commonly expect"
> > impossible-to-screw-up Unicode support that's performance. Optimizing
> > developer experience for beginning developers is just going to lead to
> > software that screws up in situations the developer doesn't anticipate,
> > as F+¬lix notes above.
> >
> > Zachary
> >
> > On Wed, Aug 17, 2016, at 09:40 AM, Kenny Leung via swift-evolution > > > wrote:
> >> I understand that the most friendly approach may not be the most
> >> efficient, but that’s not what I’m pushing for. I’m pushing for "does
the
> >> thing people would most commonly expect”. Take a first-time programmer
> >> who reads any (human) language, and that is what they would expect.
> >>
> >> Why couldn’t String’s internal storage format be glyph-based? If I
were,
> >> say, writing a text editor, it would certainly be the easiest and most
> >> efficient format to work in.
> >>
> >> -Kenny
> >>
> >>
> >>> On Aug 15, 2016, at 9:20 PM, Félix Cloutier <felixcca@yahoo.ca> > wrote:
> >>>
> >>> The major problem with this approach is that visual glyphs
themselves have one level of variable-length encoding, and they sit on top
of another variable-length encoding used to represent the Unicode
characters (Swift-native Strings are currently encoded as UTF-8). For
instance, the visual glyph is the the result of putting side-by-side
the Unicode characters 🇺 and 🇸("REGIONAL INDICATOR SYMBOL LETTER U" and
"REGIONAL INDICATOR SYMBOL LETTER S"), which are themselves encoded as
UTF-8 using 4 bytes each. A design in which you can "just write"
string[4544] hides the fact that indexing is a linear-time operation that
needs to recompose UTF-8 characters and then recompose visual glyphs on top
of that.
> >>>
> >>> Generally speaking, I *think* that I agree that human-geared "long
string" on which you probably won't need random access, and machine-geared
smaller strings that encode a command, could benefit from not being
considered the same fundamental thing. However, I'm also afraid that this
will end with more applications and websites that think that first names
only contain 7-bit-clean characters in the A-Z range. (I live in the US and
I can attest that this is still very common.)
> >>>
> >>> You could make a point too that better facilities to parse strings
would probably address this issue.
> >>>
> >>> Félix
> >>>
> >>>> Le 15 août 2016 à 10:52:02, Kenny Leung via swift-evolution < > swift-evolution@swift.org> a écrit :
> >>>>
> >>>> I agree with both points of view. I think we need to bring back
subscripting on strings which does the thing people would most commonly
expect.
> >>>>
> >>>> I would say that the subscripts indexes should correspond to a
visual glyph. This seems reasonable to me for most character sets like
Roman, Cyrillic, Chinese. There is some doubt in my mind for things like
subscripted Japanese or connected (ligatured?) languages like Arabic, Hindi
or Thai.
> >>>>
> >>>> -Kenny
> >>>>
> >>>>
> >>>>> On Aug 15, 2016, at 10:42 AM, Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:
> >>>>>
> >>>>> On Sun, Aug 14, 2016 at 5:41 PM, Michael Savich via > swift-evolution <swift-evolution@swift.org> wrote:
> >>>>> Back in Swift 1.0, subscripting a String was easy, you could just
use subscripting in a very Python like way. But now, things are a bit more
complicated. I recognize why we need syntax like
str.startIndex.advancedBy(x) but it has its downsides. Namely, it makes
things hard on beginners. If one of Swift's goals is to make it a great
first language, this syntax fights that. Imagine having to explain Unicode
and character size to an 8 year old. This is doubly problematic because
String manipulation is one of the first things new coders might want to do.
> >>>>>
> >>>>> What about having an InternalString subclass that only supports
one encoding, allowing it to be subscripted with Ints? The idea is that an
InternalString is for Strings that are more or less hard coded into the
app. Dictionary keys, enum raw values, that kind of stuff. This also has
the added benefit of forcing the programmer to think about what the String
is being used for. Is it user facing? Or is it just for internal use? And
of course, it makes code dealing with String manipulation much more concise
and readable.
> >>>>>
> >>>>> It follows that something like this would need to be entered as a
literal to make it as easy as using String. One way would be to make all
String literals InternalStrings, but that sounds far too drastic. Maybe
appending an exclamation point like "this"! Or even just wrapping the whole
thing in exclamation marks like !"this"! Of course, we could go old school
and write it like @"this" …That last one is a joke.
> >>>>>
> >>>>> I'll be the first to admit I'm way in over my head here, so I'm
very open to suggestions and criticism. Thanks!
> >>>>>
> >>>>> I can sympathize, but this is tricky.
> >>>>>
> >>>>> Fundamentally, if it's going to be a learning and teaching issue,
then this "easy" string should be the default. That is to say, if I write
`var a = "Hello, world!"`, then `a` should be inferred to be of type
InternalString or EasyString, whatever you want to call it.
> >>>>>
> >>>>> But, we also want Swift to support Unicode by default, and we want
that support to do things The Right Way(TM) by default. In other words, a
user should not have to reach for a special type in order to handle
arbitrary strings correctly, and I should be able to reassign `a = "你好"`
and have things work as expected. So, we also can't have the "easy" string
type be the default...
> >>>>>
> >>>>> I can't think of a way to square that circle.
> >>>>>
> >>>>>
> >>>>> Sent from my iPad
> >>>>>
> >>>>> _______________________________________________
> >>>>> swift-evolution mailing list
> >>>>> swift-evolution@swift.org
> >>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> swift-evolution mailing list
> >>>>> swift-evolution@swift.org
> >>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >>>>
> >>>> _______________________________________________
> >>>> swift-evolution mailing list
> >>>> swift-evolution@swift.org
> >>>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >>>
> >>
> >> _______________________________________________
> >> swift-evolution mailing list
> >> swift-evolution@swift.org
> >> https://lists.swift.org/mailman/listinfo/swift-evolution
> > _______________________________________________
> > swift-evolution mailing list
> > swift-evolution@swift.org
> > https://lists.swift.org/mailman/listinfo/swift-evolution
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

hexdreamer · August 18, 2016, 4:33pm

Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

All I meant was this, and nothing more. If the internal format was UTF-8, and you were using a filesystem whose filenames were UTF-16, you would have the same problems.

-Kenny

···

On Aug 17, 2016, at 10:40 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

Would you imagine if "n" turned out to be poorly supported by systems throughout the world and dead-serious people argued that it's too hard for beginners?

"Filesystem-friendly" and "email-friendly" names are not backed by modern standards. You can have essentially any character that you like in a file name save for the directory separator on almost every platform out there (except on Windows, but the constraints are implemented in a layer above NTFS), and addresses like félix@... are RFC-legal. Restrictions are merely wished into existence by programmers who don't want to complicate their mental model of text processing, to everyone else's detriment.

Félix

ahltorp · August 23, 2016, 12:26am

Also, until quite recently "filesystem-friendly" meant "only uppercase characters" and that only 8 (or on some systems only 6) characters could be used. Maybe these ASCII proponents want us to write everything in uppercase as well? And limit our identifiers to 6 characters. Now there's a proposal I can get behind!

FUNC HLOWRL(S: STRING) -> STRING {
RETURN "HELLO, WORLD: \(S)"
}

Or, to take your example with "n" not being supported ("m" is pretty close both phonetically and graphically):

FUMC HLOWRL(S: STRIMG) -> STRIMG {
RETURM "HELLO, WORLD: \(S)"
}

Still readable, right? And very easy for beginners.

/Magnus

···

18 Aug. 2016 07:40 Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

Would you imagine if "n" turned out to be poorly supported by systems throughout the world and dead-serious people argued that it's too hard for beginners?

"Filesystem-friendly" and "email-friendly" names are not backed by modern standards. You can have essentially any character that you like in a file name save for the directory separator on almost every platform out there (except on Windows, but the constraints are implemented in a layer above NTFS), and addresses like félix@... are RFC-legal. Restrictions are merely wished into existence by programmers who don't want to complicate their mental model of text processing, to everyone else's detriment.

Felix_Cloutier1 · August 18, 2016, 6:51pm

I'm not sure I understand your comment. UTF-8 and UTF-16 are just two different ways to represent Unicode data, and they can both encode the whole range of Unicode. Of course you'll have problems if you try to interpret UTF-8 as UTF-16 and vice-versa, but that'll do you regardless of whether you use international characters or not.Félix

>> Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

All I meant was this, and nothing more. If the internal format was UTF-8, and you were using a filesystem whose filenames were UTF-16, you would have the same problems.

-Kenny

···

On Thursday, August 18, 2016 9:33 AM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

On Aug 17, 2016, at 10:40 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

Would you imagine if "n" turned out to be poorly supported by systems throughout the world and dead-serious people argued that it's too hard for beginners?

"Filesystem-friendly" and "email-friendly" names are not backed by modern standards. You can have essentially any character that you like in a file name save for the directory separator on almost every platform out there (except on Windows, but the constraints are implemented in a layer above NTFS), and addresses like félix@... are RFC-legal. Restrictions are merely wished into existence by programmers who don't want to complicate their mental model of text processing, to everyone else's detriment.

Félix

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

hexdreamer · August 18, 2016, 7:34pm

Of course you'll have problems if you try to interpret UTF-8 as UTF-16 and vice-versa, but that'll do you regardless of whether you use international characters or not.

This is exactly my point. Even if the internal representation is UTF-8 (or UTF-16), you are not free from having to do conversions. You still need to convert to the encoding format that is understood by the receiver. I make a distinction between Unicode and Unicode encodings.

-Kenny

···

On Aug 18, 2016, at 11:51 AM, Félix Cloutier <felixcca@yahoo.ca> wrote:

On Thursday, August 18, 2016 9:33 AM, Kenny Leung via swift-evolution <swift-evolution@swift.org> wrote:

>> Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

All I meant was this, and nothing more. If the internal format was UTF-8, and you were using a filesystem whose filenames were UTF-16, you would have the same problems.

-Kenny

> On Aug 17, 2016, at 10:40 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:
>
>> In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.
>
> Would you imagine if "n" turned out to be poorly supported by systems throughout the world and dead-serious people argued that it's too hard for beginners?
>
> "Filesystem-friendly" and "email-friendly" names are not backed by modern standards. You can have essentially any character that you like in a file name save for the directory separator on almost every platform out there (except on Windows, but the constraints are implemented in a layer above NTFS), and addresses like félix@... are RFC-legal. Restrictions are merely wished into existence by programmers who don't want to complicate their mental model of text processing, to everyone else's detriment.
>
> Félix

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

xwu · August 18, 2016, 7:41pm

Actually, if I'm not mistaken, String (or at least, CFStringRef, to which
String is toll-free bridged) does not re-encode anything eagerly. If you
initialize with UTF8 bytes, it's stored internally as UTF8 bytes; if you
initialize with UTF16 code units, it's stored internally as UTF16 code
units. Re-encoding happens only when necessary--i.e. when you ask for UTF8
bytes from a UTF16-encoded string.

···

On Thu, Aug 18, 2016 at 2:34 PM, Kenny Leung via swift-evolution < swift-evolution@swift.org> wrote:

> On Aug 18, 2016, at 11:51 AM, Félix Cloutier <felixcca@yahoo.ca> wrote:
> Of course you'll have problems if you try to interpret UTF-8 as UTF-16
and vice-versa, but that'll do you regardless of whether you use
international characters or not.

This is exactly my point. Even if the internal representation is UTF-8 (or
UTF-16), you are not free from having to do conversions. You still need to
convert to the encoding format that is understood by the receiver. I make a
distinction between Unicode and Unicode encodings.

-Kenny

> On Thursday, August 18, 2016 9:33 AM, Kenny Leung via swift-evolution < > swift-evolution@swift.org> wrote:
>
>
> >> Just because you are using UTF-8 as the internal format, it does not
mean that universal support is guaranteed.
>
> All I meant was this, and nothing more. If the internal format was
UTF-8, and you were using a filesystem whose filenames were UTF-16, you
would have the same problems.
>
> -Kenny
>
>
> > On Aug 17, 2016, at 10:40 PM, Félix Cloutier <felixcca@yahoo.ca> > wrote:
> >
> >> In Félix’s case, I would expect to have to ask for a mail-friendly
representation of his name, just like you have to ask for a
filesystem-friendly representation of a filename regardless of what the
internal representation is. Just because you are using UTF-8 as the
internal format, it does not mean that universal support is guaranteed.
> >
> > Would you imagine if "n" turned out to be poorly supported by systems
throughout the world and dead-serious people argued that it's too hard for
beginners?
> >
> > "Filesystem-friendly" and "email-friendly" names are not backed by
modern standards. You can have essentially any character that you like in a
file name save for the directory separator on almost every platform out
there (except on Windows, but the constraints are implemented in a layer
above NTFS), and addresses like félix@... are RFC-legal. Restrictions are
merely wished into existence by programmers who don't want to complicate
their mental model of text processing, to everyone else's detriment.
> >
> > Félix
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Felix_Cloutier1 · August 19, 2016, 6:05am

Even UTF-32 does not provide a 1-to-1 mapping to visual glyphs. As mentioned earlier in this thread, for instance, flags are composed of two Unicode characters.

Félix

···

Le 18 août 2016 à 12:22:10, Jean-Denis Muys <jdmuys@gmail.com> a écrit :

And both are variable-length encoding. I mean that different characters do
not necessarily occupy the same number of bytes in memory.

But now, UTF-32 (or UCS-4) is a constant-length encoding. Why not using
UTF-32 as the encoding for an easy to use and easy to index string type?
The memory inefficiency of it might be a small price to pay in many cases,
including for beginners.

Finally, I oppose restricting identifiers in Swift programs to ASCII chars
only. One reason is that in scientific programming, we at last can use
greek letters, or even: א.

Jean-Denis

On Thu, Aug 18, 2016 at 8:51 PM, Félix Cloutier <swift-evolution@swift.org> > wrote:

I'm not sure I understand your comment. UTF-8 and UTF-16 are just two
different ways to represent Unicode data, and they can both encode the
whole range of Unicode. Of course you'll have problems if you try to
interpret UTF-8 as UTF-16 and vice-versa, but that'll do you regardless of
whether you use international characters or not.
Félix

On Thursday, August 18, 2016 9:33 AM, Kenny Leung via swift-evolution < >> swift-evolution@swift.org> wrote:

Just because you are using UTF-8 as the internal format, it does not

mean that universal support is guaranteed.

All I meant was this, and nothing more. If the internal format was UTF-8,
and you were using a filesystem whose filenames were UTF-16, you would have
the same problems.

-Kenny

On Aug 17, 2016, at 10:40 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:

In Félix’s case, I would expect to have to ask for a mail-friendly

representation of his name, just like you have to ask for a
filesystem-friendly representation of a filename regardless of what the
internal representation is. Just because you are using UTF-8 as the
internal format, it does not mean that universal support is guaranteed.

Would you imagine if "n" turned out to be poorly supported by systems

throughout the world and dead-serious people argued that it's too hard for
beginners?

"Filesystem-friendly" and "email-friendly" names are not backed by

modern standards. You can have essentially any character that you like in a
file name save for the directory separator on almost every platform out
there (except on Windows, but the constraints are implemented in a layer
above NTFS), and addresses like félix@... are RFC-legal. Restrictions are
merely wished into existence by programmers who don't want to complicate
their mental model of text processing, to everyone else's detriment.

Félix

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Felix_Cloutier1 · August 18, 2016, 10:43pm

When I say "reinterpret", I mean taking the UTF-8 bytes and pretend that they're UTF-16. This is an extremely clear bug whenever it happens. The correct conversion between UTF-8 and UTF-16 is lossless.

The vast majority of systems, including file systems and email addresses, support Unicode. I'm struggling to come up with an example where a restriction isn't the result of lazy assumptions. It's not like we have to pause and check that every link on the network path is 8-bit clean anymore.

Félix

Charles_Srstka · August 23, 2016, 2:53am

I wonder how possible it would be to make a string type that stored a table containing references to runs of text encoded using single code units, either as ranges, indexes of multi-code-units characters, or some kind of search tree or something so that you could quickly randomly access a character without having to parse the whole string up to that point. It would consume additional memory, but since most of the world’s most commonly used writing systems all fall into the Basic Multilingual Plane, characters represented by a single UTF-16 word should be the overwhelming majority in most cases, which might make it not completely unworkable.

Charles

···

On Aug 19, 2016, at 1:05 AM, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

Even UTF-32 does not provide a 1-to-1 mapping to visual glyphs. As mentioned earlier in this thread, for instance, flags are composed of two Unicode characters.

Félix