Strings in Swift 4

Ted, that sort of implementation grows many common strings by a factor of 8 and makes some less common strings require multiple memory allocations. Considering that our research has shown it is a big performance and energy-use win to heroically compress <https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-pointers.html&gt; strings to avoid both kinds of bloat (plenty of actual data was gathered before tagged pointer strings were added to Cocoa), a scheme like the one you're proposing is pretty much a non-starter as far as I'm concerned.

···

Sent from my moss-covered three-handled family gradunza

On Feb 22, 2017, at 5:56 AM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

Hi Ben,
thank you, yes, I know all that by now.

Have seen that one goes to great lengths to optimise, not only for storage but also for speed. But how far does this need to go? In any case, optimisation should not be used
as an argument for restricting a PLs functionality that is to refrain from PL elements which are common and useful.?

I wouldn’t worry so much over storage (unless one wants to load a complete book into memory
 in iOS, the average app is about 15-50 MB, String data is mostly a fraction of that. In macOS or similar I’d think it is even less significant


I wonder how much performance and memory consumption would be different from the current contiguous memory implementation? if a String is just is a plain row of (references to) Character (extended grapheme cluster) objects, Array<[Character>, which would simplify the basic logic and (sub)string handling significantly, because then one has direct access to the String’s elements directly, using the reasonably fast access methods of a Swift Collection/Array.

I have experimented with an alternative String struct based upon Array<Character>, seeing how easy it was to implement most popular string handling functions as one can work with the Character array directly.

Currently at deep-dive-depth in the standard lib sources, especially String & Co.

Kind Regards
TedvG

On 21 Feb 2017, at 01:31, Ben Cohen <ben_cohen@apple.com> wrote:

Hi Ted,

While Character is the Element type for String, it would be unsuitable for a String’s implementation to actually use Character for storage. Character is fairly large (currently 9 bytes), very little of which is used for most values. For unusual graphemes that require more storage, it allocates more memory on the heap. By contrast, String’s actual storage is a buffer of 1- or 2-byte elements, and all graphemes (what we expose as Characters) are held in that contiguous memory no matter how many code points they comprise. When you iterate over the string, the graphemes are unpacked into a Character on the fly. This gives you an user interface of a collection that superficially appears to resemble [Character], but this does not mean that this would be a workable implementation.

On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

Hi Ben, Dave (you should not read this now, you’re on vacation :o) & Others

As described in the Swift Standard Library API Reference:

The Character type represents a character made up of one or more Unicode scalar values,
grouped by a Unicode boundary algorithm. Generally, a Character instance matches what
the reader of a string will perceive as a single character. The number of visible characters is
generally the most natural way to count the length of a string.
The smallest discrete unit we (app programmers) are mostly working with is this
perceived visible character, what else?

If that is the case, my reasoning is, that Strings (could / should? ) be relatively simple,
because most, if not all, complexity of Unicode is confined within the Character object and
completely hidden** for the average application programmer, who normally only needs
to work with Strings which contains these visible Characters, right?
It doesn’t then make no difference at all “what’ is in” the Character, (excellent implementation btw)
(Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever)
because we rely in sublime oblivion for the visually representation of whatever is in
the Character on miraculous font processors hidden in the dark depths of the OS.

Then, in this perspective, my question is: why is String not implemented as
directly based upon an array [Character] ? In that case one can refer to the Characters of the
String directly, not only for direct subscripting and other String functionality in an efficient way.
(i do hava scope of independent Swift here, that is interaction with libraries should be
solved by the compiler, so as not to be restricted by legacy ObjC etc.

** (expect if one needs to do e.g. access individual elements and/or compose graphics directly?
      but for this purpose the Character’s properties are accessible)

For the sake of convenience, based upon the above reasoning, I now “emulate" this in
a string extension, thereby ignoring the rare cases that a visible character could be based
upon more than a single Character (extended grapheme cluster) If that would occur,
thye should be merged into one extended grapheme cluster, a single Character that is.

//: Playground - implement direct subscripting using a Character array
// of course, when the String is defined as an array of Characters, directly
// accessible it would be more efficient as in these extension functions.
extension String
{
    var count: Int
        {
        get
        {
            return self.characters.count
        }
    }

    subscript (n: Int) -> String
    {
        return String(Array(self.characters)[n])
    }
    
    subscript (r: Range<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
    
    subscript (r: ClosedRange<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
}

func test()
{
    let zoo = "Koala :koala:, Snail :snail:, Penguin :penguin:, Dromedary :dromedary_camel:"
    print("zoo has \(zoo.count) characters (discrete extended graphemes):")
    for i in 0..<zoo.count
    {
        print(i,zoo[i],separator: "=", terminator:" ")
    }
    print("\n")
    print(zoo[0..<7])
    print(zoo[9..<16])
    print(zoo[18...26])
    print(zoo[29...39])
    print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39])
}

test()

this works as intended and generates the following output:

zoo has 40 characters (discrete extended graphemes):
0=K 1=o 2=a 3=l 4=a 5= 6=🐹 7=, 8= 9=S 10=n 11=a 12=i 13=l 14= 15=🐌 16=, 17=
18=P 19=e 20=n 21=g 22=u 23=i 24=n 25= 26=🐧 27=, 28= 29=D 30=r 31=o 32=m
33=e 34=d 35=a 36=r 37=y 38= 39=đŸȘ

Koala :koala:
Snail :snail:
Penguin :penguin:
Dromedary :dromedary_camel:
images::koala::snail::penguin::dromedary_camel:

I don’t know how (in) efficient this method is.
but in many cases this is not so important as e.g. with numerical computation.

I still fail to understand why direct subscripting strings would be unnecessary,
and would like to see this built-in in Swift asap.

Btw, I do share the concern as expressed by Rien regarding the increasing complexity of the language.

Kind Regards,

TedvG

Given that the behavior you desire is literally a few key strokes away (see below), it would be unfortunate to pessimize the internal representation of Strings for every application. This would destroy the applicability of the Swift standard library to entire areas of computing such as application development for mobile devices (Swift's current largest niche). The idea of abstraction is that you can provide a high-level view of things stored at a lower-level in accordance with sensible higher-level semantics and expectations. If you want random access, then you can eagerly project the characters (see below). This is consistent with the standard library’s preference for lazy sequences when providing a eager one would result in a large up-front cost that might be avoidable otherwise.

Here’s playground code that gives you what you’re requesting, by doing an eager projection (rather than a lazy one, which is the default):

extension String {
    var characterArray: [Character] {
        return characters.map { $0 }
    }
}
let str = "abcdefg\(UnicodeScalar(0x302)!)"
let charArray = str.characterArray
charArray[4] // results in "e"
charArray[6] // results in "ĝ"

Note that you get random access AND safety by operating at the Character level. If you operate at the unicode scalar value level instead, you might be splitting canonical combining sequences accidentally.

···

On Feb 22, 2017, at 7:56 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

Hi Ben,
thank you, yes, I know all that by now.

Have seen that one goes to great lengths to optimise, not only for storage but also for speed. But how far does this need to go? In any case, optimisation should not be used
as an argument for restricting a PLs functionality that is to refrain from PL elements which are common and useful.?

I wouldn’t worry so much over storage (unless one wants to load a complete book into memory
 in iOS, the average app is about 15-50 MB, String data is mostly a fraction of that. In macOS or similar I’d think it is even less significant


I wonder how much performance and memory consumption would be different from the current contiguous memory implementation? if a String is just is a plain row of (references to) Character (extended grapheme cluster) objects, Array<[Character>, which would simplify the basic logic and (sub)string handling significantly, because then one has direct access to the String’s elements directly, using the reasonably fast access methods of a Swift Collection/Array.

I have experimented with an alternative String struct based upon Array<Character>, seeing how easy it was to implement most popular string handling functions as one can work with the Character array directly.

Currently at deep-dive-depth in the standard lib sources, especially String & Co.

Kind Regards
TedvG

On 21 Feb 2017, at 01:31, Ben Cohen <ben_cohen@apple.com <mailto:ben_cohen@apple.com>> wrote:

Hi Ted,

While Character is the Element type for String, it would be unsuitable for a String’s implementation to actually use Character for storage. Character is fairly large (currently 9 bytes), very little of which is used for most values. For unusual graphemes that require more storage, it allocates more memory on the heap. By contrast, String’s actual storage is a buffer of 1- or 2-byte elements, and all graphemes (what we expose as Characters) are held in that contiguous memory no matter how many code points they comprise. When you iterate over the string, the graphemes are unpacked into a Character on the fly. This gives you an user interface of a collection that superficially appears to resemble [Character], but this does not mean that this would be a workable implementation.

On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi Ben, Dave (you should not read this now, you’re on vacation :o) & Others

As described in the Swift Standard Library API Reference:

The Character type represents a character made up of one or more Unicode scalar values,
grouped by a Unicode boundary algorithm. Generally, a Character instance matches what
the reader of a string will perceive as a single character. The number of visible characters is
generally the most natural way to count the length of a string.
The smallest discrete unit we (app programmers) are mostly working with is this
perceived visible character, what else?

If that is the case, my reasoning is, that Strings (could / should? ) be relatively simple,
because most, if not all, complexity of Unicode is confined within the Character object and
completely hidden** for the average application programmer, who normally only needs
to work with Strings which contains these visible Characters, right?
It doesn’t then make no difference at all “what’ is in” the Character, (excellent implementation btw)
(Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever)
because we rely in sublime oblivion for the visually representation of whatever is in
the Character on miraculous font processors hidden in the dark depths of the OS.

Then, in this perspective, my question is: why is String not implemented as
directly based upon an array [Character] ? In that case one can refer to the Characters of the
String directly, not only for direct subscripting and other String functionality in an efficient way.
(i do hava scope of independent Swift here, that is interaction with libraries should be
solved by the compiler, so as not to be restricted by legacy ObjC etc.

** (expect if one needs to do e.g. access individual elements and/or compose graphics directly?
      but for this purpose the Character’s properties are accessible)

For the sake of convenience, based upon the above reasoning, I now “emulate" this in
a string extension, thereby ignoring the rare cases that a visible character could be based
upon more than a single Character (extended grapheme cluster) If that would occur,
thye should be merged into one extended grapheme cluster, a single Character that is.

//: Playground - implement direct subscripting using a Character array
// of course, when the String is defined as an array of Characters, directly
// accessible it would be more efficient as in these extension functions.
extension String
{
    var count: Int
        {
        get
        {
            return self.characters.count
        }
    }

    subscript (n: Int) -> String
    {
        return String(Array(self.characters)[n])
    }
    
    subscript (r: Range<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
    
    subscript (r: ClosedRange<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
}

func test()
{
    let zoo = "Koala :koala:, Snail :snail:, Penguin :penguin:, Dromedary :dromedary_camel:"
    print("zoo has \(zoo.count) characters (discrete extended graphemes):")
    for i in 0..<zoo.count
    {
        print(i,zoo[i],separator: "=", terminator:" ")
    }
    print("\n")
    print(zoo[0..<7])
    print(zoo[9..<16])
    print(zoo[18...26])
    print(zoo[29...39])
    print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39])
}

test()

this works as intended and generates the following output:

zoo has 40 characters (discrete extended graphemes):
0=K 1=o 2=a 3=l 4=a 5= 6=🐹 7=, 8= 9=S 10=n 11=a 12=i 13=l 14= 15=🐌 16=, 17=
18=P 19=e 20=n 21=g 22=u 23=i 24=n 25= 26=🐧 27=, 28= 29=D 30=r 31=o 32=m
33=e 34=d 35=a 36=r 37=y 38= 39=đŸȘ

Koala :koala:
Snail :snail:
Penguin :penguin:
Dromedary :dromedary_camel:
images::koala::snail::penguin::dromedary_camel:

I don’t know how (in) efficient this method is.
but in many cases this is not so important as e.g. with numerical computation.

I still fail to understand why direct subscripting strings would be unnecessary,
and would like to see this built-in in Swift asap.

Btw, I do share the concern as expressed by Rien regarding the increasing complexity of the language.

Kind Regards,

TedvG

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

ok, I understand, thank you
TedvG

···

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11
14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

Thank you Michael,
I did that already in this extension: (as written before)

extension String
{
    var count: Int
        {
        get
        {
            return self.characters.count
        }
    }

// properties in extensions not possible
// var ar = Array(self.characters)

    subscript (n: Int) -> String
    {
        return String(Array(self.characters)[n])
    }
    
    subscript (r: Range<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
    
    subscript (r: ClosedRange<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
}

but this is not so efficient, because for each subscript invocation
the Character array must be built again: ( If not cached within String)
I assume, it must be reloaded each time because one cannot create create new
properties in extensions (why not?) like a Character Array as in the above comment
  

Given that the behavior you desire is literally a few key strokes away (see below), it would be unfortunate to pessimize the internal representation of Strings for every application. This would destroy the applicability of the Swift standard library to entire areas of computing such as application development for mobile devices (Swift's current largest niche). The idea of abstraction is that you can provide a high-level view of things stored at a lower-level in accordance with sensible higher-level semantics and expectations. If you want random access, then you can eagerly project the characters (see below). This is consistent with the standard library’s preference for lazy sequences when providing a eager one would result in a large up-front cost that might be avoidable otherwise.

mostly true.

Here’s playground code that gives you what you’re requesting, by doing an eager projection (rather than a lazy one, which is the default):

Your extension is more efficient than my subscript extension above,
because the Character array is drawn once from the String, instead of that each
time the str.characters property is scanned again
@Dave :
is that the case, or is the character view cached , so that it
doesn’t matter much if the characterView is retrieved frequently?

extension String {
    var characterArray: [Character] {
        return characters.map { $0 }
    }
}
let str = "abcdefg\(UnicodeScalar(0x302)!)"
let charArray = str.characterArray
charArray[4] // results in "e"
charArray[6] // results in “ĝ”

I would normally subclass String, but in Swift I can’t do this
because String is a struct, inheritance of structs is not
possible in Swift.

@Dave:
Thanks for the explanation and the link (it’s been a long time
ago reading about pointers, normally I try to avoid these things like the plague..)

Factor 8? that's a big storage difference.. Currently still diving into Swift stdlib,
maybe I’ll get some bright ideas there , but don’t count on it :o)

However, for the String struct, I have another suggestion/solution/question if I may:

If String’s CharacterView is not cached (or is it?) to prevent repetitive regeneration,
but even then:

What about having a (lazy) Array<Character> property inside String?
which:
      is normally nil and only created when parts of aString are
      accessed/changed e.g. with subscription.
      will be nil again when String has changed.
can also be disposed of (to nil or emptied) upon request:
      str.disposeCharacterArray()
   or maybe:
      str.compactString()
      str.freeSpace()

Although then available as a property like this:
      str.characterArray ,
normally one would not access this character array directly,
but rather implicitly with subscripting on the String itself, like str[n
m].
In that case, if it does not already exist, this character array inside String
will be created and remains alive until aString disappears , changes, or
the string’s character array is explicitly disposed.
(e.g. useful when many strings are involved, to free storage)

in that way:
No unnecessary storage is allocated for Character arrays,
but only when the need arises.
There are no longer performance based restrictions for the programmer
to subscript strings directly. Hooray!

Not only to *get* but also to *set* substrings.
(The latter would of course require String-inside
processing of the Character array. updating the
in the String)

Furthermore, one could base nearly all
string handling like substring, replace, search, etc.
directly on this character array without the
need to walk through the contiguous String storage
itself each time at runtime.

Flexible! So one can do all this and more:
     str[5] = “x”
     let s = str[5]
     str[3
5] = “HAL”
     str[range] = str[range].reversed()
     var s = str[10..<28]
    if str[p1..<p1+ length] == “Dakota” {
}
   notes[bar1..<bar1+6] = “EADGBE”
    etc.

   (try to do this with the existing string handling functions..)
   and also roll your own string handling functions directly
  based on subscripting. possibly in own extensions.
?

In that way we can forget the imho -sorry, excuse l’moi- awkward and tedious constructions like:
  str.substringWithRange(Range<String.Index>(start: str.startIndex, end: str.endIndex))
horrible, too much typing, can’t read these things, have to look them up each time..
?
Kind Regards
TedvG
( I am Dutch and living in Germany (like being here but it doesn’t help my English much :o) )
www.tedvg.com <http://www.tedvg.com/&gt;
www.ravelnotes.com <http://www.ravelnotes.com/&gt;

···

On 22 Feb 2017, at 19:43, Michael Ilseman <milseman@apple.com> wrote:

Note that you get random access AND safety by operating at the Character level. If you operate at the unicode scalar value level instead, you might be splitting canonical combining sequences accidentally.

On Feb 22, 2017, at 7:56 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Hi Ben,
thank you, yes, I know all that by now.

Have seen that one goes to great lengths to optimise, not only for storage but also for speed. But how far does this need to go? In any case, optimisation should not be used
as an argument for restricting a PLs functionality that is to refrain from PL elements which are common and useful.?

I wouldn’t worry so much over storage (unless one wants to load a complete book into memory
 in iOS, the average app is about 15-50 MB, String data is mostly a fraction of that. In macOS or similar I’d think it is even less significant


I wonder how much performance and memory consumption would be different from the current contiguous memory implementation? if a String is just is a plain row of (references to) Character (extended grapheme cluster) objects, Array<[Character>, which would simplify the basic logic and (sub)string handling significantly, because then one has direct access to the String’s elements directly, using the reasonably fast access methods of a Swift Collection/Array.

I have experimented with an alternative String struct based upon Array<Character>, seeing how easy it was to implement most popular string handling functions as one can work with the Character array directly.

Currently at deep-dive-depth in the standard lib sources, especially String & Co.

Kind Regards
TedvG

On 21 Feb 2017, at 01:31, Ben Cohen <ben_cohen@apple.com <mailto:ben_cohen@apple.com>> wrote:

Hi Ted,

While Character is the Element type for String, it would be unsuitable for a String’s implementation to actually use Character for storage. Character is fairly large (currently 9 bytes), very little of which is used for most values. For unusual graphemes that require more storage, it allocates more memory on the heap. By contrast, String’s actual storage is a buffer of 1- or 2-byte elements, and all graphemes (what we expose as Characters) are held in that contiguous memory no matter how many code points they comprise. When you iterate over the string, the graphemes are unpacked into a Character on the fly. This gives you an user interface of a collection that superficially appears to resemble [Character], but this does not mean that this would be a workable implementation.

On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi Ben, Dave (you should not read this now, you’re on vacation :o) & Others

As described in the Swift Standard Library API Reference:

The Character type represents a character made up of one or more Unicode scalar values,
grouped by a Unicode boundary algorithm. Generally, a Character instance matches what
the reader of a string will perceive as a single character. The number of visible characters is
generally the most natural way to count the length of a string.
The smallest discrete unit we (app programmers) are mostly working with is this
perceived visible character, what else?

If that is the case, my reasoning is, that Strings (could / should? ) be relatively simple,
because most, if not all, complexity of Unicode is confined within the Character object and
completely hidden** for the average application programmer, who normally only needs
to work with Strings which contains these visible Characters, right?
It doesn’t then make no difference at all “what’ is in” the Character, (excellent implementation btw)
(Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever)
because we rely in sublime oblivion for the visually representation of whatever is in
the Character on miraculous font processors hidden in the dark depths of the OS.

Then, in this perspective, my question is: why is String not implemented as
directly based upon an array [Character] ? In that case one can refer to the Characters of the
String directly, not only for direct subscripting and other String functionality in an efficient way.
(i do hava scope of independent Swift here, that is interaction with libraries should be
solved by the compiler, so as not to be restricted by legacy ObjC etc.

** (expect if one needs to do e.g. access individual elements and/or compose graphics directly?
      but for this purpose the Character’s properties are accessible)

For the sake of convenience, based upon the above reasoning, I now “emulate" this in
a string extension, thereby ignoring the rare cases that a visible character could be based
upon more than a single Character (extended grapheme cluster) If that would occur,
thye should be merged into one extended grapheme cluster, a single Character that is.

//: Playground - implement direct subscripting using a Character array
// of course, when the String is defined as an array of Characters, directly
// accessible it would be more efficient as in these extension functions.
extension String
{
    var count: Int
        {
        get
        {
            return self.characters.count
        }
    }

    subscript (n: Int) -> String
    {
        return String(Array(self.characters)[n])
    }
    
    subscript (r: Range<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
    
    subscript (r: ClosedRange<Int>) -> String
    {
        return String(Array(self.characters)[r])
    }
}

func test()
{
    let zoo = "Koala :koala:, Snail :snail:, Penguin :penguin:, Dromedary :dromedary_camel:"
    print("zoo has \(zoo.count) characters (discrete extended graphemes):")
    for i in 0..<zoo.count
    {
        print(i,zoo[i],separator: "=", terminator:" ")
    }
    print("\n")
    print(zoo[0..<7])
    print(zoo[9..<16])
    print(zoo[18...26])
    print(zoo[29...39])
    print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39])
}

test()

this works as intended and generates the following output:

zoo has 40 characters (discrete extended graphemes):
0=K 1=o 2=a 3=l 4=a 5= 6=🐹 7=, 8= 9=S 10=n 11=a 12=i 13=l 14= 15=🐌 16=, 17=
18=P 19=e 20=n 21=g 22=u 23=i 24=n 25= 26=🐧 27=, 28= 29=D 30=r 31=o 32=m
33=e 34=d 35=a 36=r 37=y 38= 39=đŸȘ

Koala :koala:
Snail :snail:
Penguin :penguin:
Dromedary :dromedary_camel:
images::koala::snail::penguin::dromedary_camel:

I don’t know how (in) efficient this method is.
but in many cases this is not so important as e.g. with numerical computation.

I still fail to understand why direct subscripting strings would be unnecessary,
and would like to see this built-in in Swift asap.

Btw, I do share the concern as expressed by Rien regarding the increasing complexity of the language.

Kind Regards,

TedvG

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Ted,

It might have helped if instead of being called String and Character, they were named Text and ExtendedGraphemeCluster.

They don’t really have the same behavior or functionality as string/characters in many other languages, especially older languages. This is because in many languages, strings are not just text but also random-accesss (possibly binary) data.

Thats not to say that there aren’t a ton of algorithms where you can use Text like a String, treat ExtendedGraphemeCluster like a character, and get unicode behavior without thinking about it.

But when it comes to random access and/or byte modification, you are better off working with something closer to a traditional (byte) string interface.

Trying to wedge random access and byte modification into the Swift String will simply complicate everything, slow down the algorithms which don’t need it, eat up more memory, as well as slow down bridging between Swift and Objective C code.

Hence me suggesting earlier working with Data, [UInt8], or [Character] within the context of your manipulation code, then converting to a Swift String at the end. Convert to the data format you need, then convert back.

Thats not to say that there aren’t features which would simplify/clarify algorithms working in this manner.

-DW

···

On Feb 24, 2017, at 4:27 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

ok, I understand, thank you
TedvG

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11
14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access. Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully. I'm afraid you will have to accept being disappointed about this.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access. Heroic attempts to present the illusion that they are randomly-accessible are not going to fly. These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

···

From me, the answer remains "no."

Sent from my moss-covered three-handled family gradunza

On Feb 22, 2017, at 1:40 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

What about having a (lazy) Array<Character> property inside String?

Hi David W.
please read inline responses

Ted,

It might have helped if instead of being called String and Character, they were named Text and ExtendedGraphemeCluster.

Imho,l “text” maybe, but in computer programming “String” is more appropriate, I think. see:
String (computer science) - Wikipedia <https://en.wikipedia.org/wiki/String_(computer_science)&gt;

Also imho, “Character” is OK (but maybe “Symbol” would be better) because mostly, when working
with text/strings in an application it is not important to know how the Character is encoded,
e.g. Unicode, ASCII, whatever.(OOP , please hide the details, thank you)

However, If I needed to work with the character’s components directly, e.g. when I
might want to influence the display of
the underlying graphical aspects, I always have access to the Characters’ properties
and methods. Unicode codepoints, ASCII bytes.. whatever it contains...
   

They don’t really have the same behavior or functionality as string/characters in many other languages, especially older languages. This is because in many languages, strings are not just text but also random-accesss (possibly binary) data.

could be but that’s not my perception of a string.

Thats not to say that there aren’t a ton of algorithms where you can use Text like a String, treat ExtendedGraphemeCluster like a character, and get unicode behavior without thinking about it.

But when it comes to random access and/or byte modification, you are better off working with something closer to a traditional (byte) string interface.

Trying to wedge random access and byte modification into the Swift String will simply complicate everything, slow down the algorithms which don’t need it, eat up more memory, as well as slow down bridging between Swift and Objective C code.

Yes, this has been extensively discussed in this thread...

Hence me suggesting earlier working with Data, [UInt8], or [Character] within the context of your manipulation code, then converting to a Swift String at the end. Convert to the data format you need, then convert back.

That’s exactly what I did, saved that I have the desire to work exclusively with discrete
(in the model of humanly visible discrete elements on a graphical medium) ...

For the sake of completeness ,here is my complete Swift 3.x playground example, may useful for others too:
//: Playground - noun: a place with Character!

import UIKit
import Foundation

struct TGString: CustomStringConvertible
{
    var ar = [Character]()
    
    var description: String // used by "print" and "\(...)"
    {
        return String(ar)
    }
    
    // Construct from a String
    init(_ str : String)
    {
        ar = Array(str.characters)
    }
    // Construct from a Character array
    init(_ tgs : [Character])
    {
        ar = tgs
    }
    // Construct from anything. well sort of..
    init(_ whatever : Any)
    {
        ar = Array("\(whatever)".characters)
    }
    
    var $: String
        {
        get // return as a normal Swift String
        {
            return String(ar)
        }
        set (str) //Mutable: set from a Swift String
        {
            ar = Array(str.characters)
        }
    }
    
    var asString: String
        {
        get // return as a normal Swift String
        {
            return String(ar)
        }
        set (str) //Mutable: set from a Swift String
        {
            ar = Array(str.characters)
        }
    }
    
    // Return the count of total number of characters:
    var count: Int
        {
        get
        {
            return ar.count
        }
    }
    
    // Return empty status:
    
    var isEmpty: Bool
        {
        get
        {
            return ar.isEmpty
        }
    }
    
    // s[n1..<n2]
    subscript (n: Int) -> TGString
        {
        get
        {
            return TGString( [ar[n]] )
        }
        set(newValue)
        {
            if newValue.isEmpty
            {
                ar.remove(at: n) // remove element when empty
            }
            else
            {
                ar[n] = newValue.ar[0]
                if newValue.count > 1
                {
                    insert(at: n, string: newValue[1..<newValue.count])
                }
            }
        }
    }

    subscript (r: Range<Int>) -> TGString
        {
        get
        {
            return TGString( Array(ar[r]) )
        }
        set(newValue)
        {
            ar[r] = ArraySlice(newValue.ar)
        }
    }

    subscript (r: ClosedRange<Int>) -> TGString
    {
        get
        {
            return TGString( Array(ar[r]) )
        }
        set(newValue)
        {
            ar[r] = ArraySlice(newValue.ar)
        }
    }

    func right( _ len: Int) -> TGString
    {
        var l = len
        
        if l > count
        {
            l = count
        }
        return TGString(Array(ar[count - l..<count]))
    }
    
    func left(_ len: Int) -> TGString
    {
        var l = len
        
        if l > count
        {
            l = count
        }
        return TGString(Array(ar[0..<l]))
    }
    
    func mid(_ pos: Int, _ len: Int) -> TGString
    {
        if pos >= count
        {
            return TGString.empty()
        }
        
        var l = len
        
        if l > pos + len
        {
            l = pos + len
        }
        return TGString(Array(ar[pos..<pos + l]))
    }
    
    func mid(_ pos: Int) -> TGString
    {
        if pos >= count
        {
            return TGString.empty()
        }
        
        return TGString(Array(ar[pos..<count]))
    }
    
    mutating func insert(at: Int, string: TGString)
    {
        ar.insert(contentsOf: string.ar, at: at)
    }

    // Concatenate
    static func + (left: TGString, right: TGString) -> TGString
    {
        return(TGString(left.ar + right.ar) )
    }
    
    // Return an empty TGString:
    static func empty() -> TGString
    {
        return TGString([Character]())
    }
} // end TGString

// trivial isn’t? but effective...
var strabc = "abcdefghjiklmnopqrstuvwxyz"
var strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))
    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lenghts don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()

outputs this:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
pq
mnopqrst
abc34fghYZkl$$$nopqrstuvwxyz
abcABC34fghYZkl$$$nopqrstuvwxyz

Thats not to say that there aren’t features which would simplify/clarify algorithms working in this manner.

true.
This discussion was interesting, triggers further thinking,
maybe even more because it touched more principal considerations

As you know, of course, a programming language is always a compromise between human
and computer “The Machine” so to speak. It started years ago with writing
Assembler then came higher PLs like Fortran, PL/1 Cobol etc. later C and C++
to just name a few
 Also we see deviations in directions like OOP FP..
(and everybody thinks they’re right of course even me :o)

What (even in this time (2017)) often seems to be an unavoidable obstacle
is the tradeoff/compromise speed/distance-from-the machine, that is,
how far optimisation aspects are emerging/surfacing through all these
layers of abstraction into the upper levels of the programming language...

In this view, the essence of this discussion was perhaps then not the triviality
wether or not one should instantiate a character array or not, but rather that
obviously (not only) in Swift these underlying optimisation aspects more or
less form a undesired restriction
 ?

TedvG.
1980 - from Yes song: "Machine Messiah" - read the lyrics also: very much in context here!

···

On 25 Feb 2017, at 07:26, David Waite <david@alkaline-solutions.com> wrote:

-DW

On Feb 24, 2017, at 4:27 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

ok, I understand, thank you
TedvG

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11
14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Ted,

It might have helped if instead of being called String and Character, they were named Text

I would oppose taking a good name like “Text” and using it for Strings which are mostly for machine processing purposes, but can be human-presentable with explicit locale. A name like Text would a better fit for Strings bundled with locale etc. for the purpose of presentation to humans, which must always be in the context of some locale (even if a “default” system locale). Refer to the sections in the String manifesto[1][2]. Such a Text type is definitely out-of-scope for current discussion.

[1] https://github.com/apple/swift/blob/master/docs/StringManifesto.md#the-default-behavior-of-string
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#the-default-behavior-of-string&gt;\[2\] https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions

and ExtendedGraphemeCluster.

What is expressed by Swift’s Character type is what the Unicode standard often refers to as a “user-perceived character”. Note that “character” by it self is not meaningful in Unicode (though it is often thrown about casually). In Swift, Character is an appropriate name here for the concept of a user-perceived character. If you want bytes, then you can use UInt8. If you want Unicode scalar values, you can use UnicodeScalar. If you want code units, you can use whatever that ends up looking (probably an associated type named CodeUnit that is bound to UInt8 or UInt16 depending on the encoding).

···

On Feb 25, 2017, at 3:26 PM, David Waite via swift-evolution <swift-evolution@swift.org> wrote:

They don’t really have the same behavior or functionality as string/characters in many other languages, especially older languages. This is because in many languages, strings are not just text but also random-accesss (possibly binary) data.

Thats not to say that there aren’t a ton of algorithms where you can use Text like a String, treat ExtendedGraphemeCluster like a character, and get unicode behavior without thinking about it.

But when it comes to random access and/or byte modification, you are better off working with something closer to a traditional (byte) string interface.

Trying to wedge random access and byte modification into the Swift String will simply complicate everything, slow down the algorithms which don’t need it, eat up more memory, as well as slow down bridging between Swift and Objective C code.

Hence me suggesting earlier working with Data, [UInt8], or [Character] within the context of your manipulation code, then converting to a Swift String at the end. Convert to the data format you need, then convert back.

Thats not to say that there aren’t features which would simplify/clarify algorithms working in this manner.

-DW

On Feb 24, 2017, at 4:27 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

ok, I understand, thank you
TedvG

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11
14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.
   

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas


I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what? I don’t get this either. most components in the real world can be accessed randomly.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium, how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements


Sorry, with respect, we have a difference of opinion here.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

···

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Ted,

It might have helped if instead of being called String and Character, they were named Text

I would oppose taking a good name like “Text” and using it for Strings which are mostly for machine processing purposes, but can be human-presentable with explicit locale. A name like Text would a better fit for Strings bundled with locale etc. for the purpose of presentation to humans, which must always be in the context of some locale (even if a “default” system locale). Refer to the sections in the String manifesto[1][2]. Such a Text type is definitely out-of-scope for current discussion.

Oh, I would never propose such a naming change, because I am comfortable with the existing names. I’m just acknowledging that the history of string manipulation causes friction in developers coming from other languages, in that they may expect certain functionality which doesn’t make sense within String’s goals.

I was merely illustrating that there is a big difference to how strings work in traditional languages and how a truly unicode-safe strings work. In scripting languages like ruby and python, string bears the brunt of binary data handling. Even in languages like Java and C#, unicode support takes compromises that Swift seems unwilling to make.

IMO, that Swift String doesn’t have random access capabilities is not a deficiency in Swift, but can cause misunderstandings of how Swift strings differ from other languages.

and ExtendedGraphemeCluster.

What is expressed by Swift’s Character type is what the Unicode standard often refers to as a “user-perceived character”. Note that “character” by it self is not meaningful in Unicode (though it is often thrown about casually). In Swift, Character is an appropriate name here for the concept of a user-perceived character. If you want bytes, then you can use UInt8. If you want Unicode scalar values, you can use UnicodeScalar. If you want code units, you can use whatever that ends up looking (probably an associated type named CodeUnit that is bound to UInt8 or UInt16 depending on the encoding).

A character “char" in C or C++ is considered nearly universally to be an 8-bit value. A Character in Java or Char in C# is a 16 bit (UTF-16) value. All of these effectively behave as integer values (with Character in java having the unique quality of being unsigned).

IMO, that Swift Character doesn’t behave as an integer value but rather closer to a string holding one user-perceived character is not a deficiency in Swift, but can cause misunderstandings because of how Swift differs from other languages.

-DW

···

On Feb 25, 2017, at 2:54 PM, Michael Ilseman <milseman@apple.com> wrote:

On Feb 25, 2017, at 3:26 PM, David Waite via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas


But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements


Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

···

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Wouldn’t that turn simple character access into a mutating function?

- Dave Sweeris

···

On Feb 23, 2017, at 4:04 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Exactly.

···

Sent from my moss-covered three-handled family gradunza

On Feb 24, 2017, at 9:49 AM, David Sweeris <davesweeris@mac.com> wrote:

On Feb 23, 2017, at 4:04 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Wouldn’t that turn simple character access into a mutating function?

- Dave Sweeris

Hi Dave
Thanks for your time to go in to this an explain.
This optimising goes much further then I thought.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

yes, I understand, it would become too iterative and time consuming I guess.
( how can you become work-detached if you keep doing things like this during vacation? )

Enjoy your vacation!
TedvG

···

On 24 Feb 2017, at 22:40, Dave Abrahams <dabrahams@apple.com> wrote:

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas


But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements


Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11
14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

Kind Regards
TedvG

···

On 24 Feb 2017, at 22:40, Dave Abrahams <dabrahams@apple.com> wrote:

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas


But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements


Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Question of newbie (sorry!):

Is it true or false that any grapheme cluster can be translated into a composite character? I understand this is certainly true at least for surrogate pairs who can be translated into one code point (i.e. one UTF32 character).

If that is true, we could transparently convert any text file (whether UTF32, UTF16, UTF8 or UTF7, LOW or HIGH) into a series of UTF32 composite characters: we would then gain random access of characters in strings (essential for NLP parsers), as well as reversibility data <=> string <=> array of characters.

I'm almost sure it can't be true, else Unicode Variant selector would be pointless.

Moreover, using UTF-32 for memory representation is a major waste of space, especially for parsers that need to be able to stream the content.

To get random access, you would have to keep the whole converted array in memory.

It is false.

:woman_zombie: is:

  • 1 grapheme (Swift's Character)
  • 4 scalars (Swifts Unicode.Scalar):
    • U+1F9DF, U+200D, U+2640, U+FE0F
  • 5 UTF-16 code units (Swift's String.UTF16View.Element):
    • D83E DDDF 200D 0020 2640
  • 11 UTF-8 code units (Swift's String.UTF8View.Element):
    • F0 9F A7 9F E2 80 8D 20 E2 99 80

edit: Bah, Discourse doesn't have support for the latest emoji! This cannot stand! For reference, here is the emoji: đŸ§Ÿâ€â™€ïž Woman Zombie Emoji

1 Like

I am being stuborn...

Jean-Daniel: I need to parse texts in over 30 languages, including Arabic, Chinese, etc. My parser needs constant time access to any character, even if this character has one or two or more diacritics (for instance, in Arabic and Hebrew texts, vowels are almost always absent). On the other hand, the size of the texts to parse are seldom over a few hundred mega-bytes (200 MB for a full year of the newspaper Le Monde), therefore, with 64GB RAM available on basically any desktop computer (even on my iPhone), parsing UTF32 texts is not a problem...

Michael: you are right, but let me re-aim my question: is that the case that for all natural languages, any letter with all its diacritics (e.g. Hebrew shin + dagesh + dot), or ligatures (oe), or combined letters (e.g. Korean triplets) can be associated with an equivalent composite character, i.e. has one 4-byte uniscalar?

Do you consider Emoji to not be "natural language"? Because there are plenty of Emoji which decompose to more than 4 Unicode scalars:

let c = "đŸ‘©â€đŸ‘©â€đŸ‘§â€đŸ‘§"
print(Array(c)) // => ["đŸ‘©â€đŸ‘©â€đŸ‘§â€đŸ‘§"]
print(Array(c.unicodeScalars)) // => ["\u{0001F469}", "\u{200D}", "\u{0001F469}", "\u{200D}", "\u{0001F467}", "\u{200D}", "\u{0001F467}"]

There are also conjunct characters in scripts such as Devanagari which are considered one logical character, even though the Unicode rules for grapheme clusters split them out into different characters, IIRC:

let str = "à€–à„à€œà„à€ž"
print(Array(str)) // => ["à€–à„", "à€œà„", "à€ž"]
print(Array(str.unicodeScalars)) // => ["\u{0916}", "\u{094D}", "\u{091C}", "\u{094D}", "\u{091E}"]