Strings in Swift 4

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access. Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully. I'm afraid you will have to accept being disappointed about this.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access. Heroic attempts to present the illusion that they are randomly-accessible are not going to fly. These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

···

From me, the answer remains "no."

Sent from my moss-covered three-handled family gradunza

On Feb 22, 2017, at 1:40 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

What about having a (lazy) Array<Character> property inside String?

Hi David W.
please read inline responses

Ted,

It might have helped if instead of being called String and Character, they were named Text and ExtendedGraphemeCluster.

Imho,l “text” maybe, but in computer programming “String” is more appropriate, I think. see:
https://en.wikipedia.org/wiki/String_(computer_science) <https://en.wikipedia.org/wiki/String_(computer_science)>

Also imho, “Character” is OK (but maybe “Symbol” would be better) because mostly, when working
with text/strings in an application it is not important to know how the Character is encoded,
e.g. Unicode, ASCII, whatever.(OOP , please hide the details, thank you)

However, If I needed to work with the character’s components directly, e.g. when I
might want to influence the display of
the underlying graphical aspects, I always have access to the Characters’ properties
and methods. Unicode codepoints, ASCII bytes.. whatever it contains...
   

They don’t really have the same behavior or functionality as string/characters in many other languages, especially older languages. This is because in many languages, strings are not just text but also random-accesss (possibly binary) data.

could be but that’s not my perception of a string.

Thats not to say that there aren’t a ton of algorithms where you can use Text like a String, treat ExtendedGraphemeCluster like a character, and get unicode behavior without thinking about it.

But when it comes to random access and/or byte modification, you are better off working with something closer to a traditional (byte) string interface.

Trying to wedge random access and byte modification into the Swift String will simply complicate everything, slow down the algorithms which don’t need it, eat up more memory, as well as slow down bridging between Swift and Objective C code.

Yes, this has been extensively discussed in this thread...

Hence me suggesting earlier working with Data, [UInt8], or [Character] within the context of your manipulation code, then converting to a Swift String at the end. Convert to the data format you need, then convert back.

That’s exactly what I did, saved that I have the desire to work exclusively with discrete
(in the model of humanly visible discrete elements on a graphical medium) ...

For the sake of completeness ,here is my complete Swift 3.x playground example, may useful for others too:
//: Playground - noun: a place with Character!

import UIKit
import Foundation

struct TGString: CustomStringConvertible
{
    var ar = [Character]()
    
    var description: String // used by "print" and "\(...)"
    {
        return String(ar)
    }
    
    // Construct from a String
    init(_ str : String)
    {
        ar = Array(str.characters)
    }
    // Construct from a Character array
    init(_ tgs : [Character])
    {
        ar = tgs
    }
    // Construct from anything. well sort of..
    init(_ whatever : Any)
    {
        ar = Array("\(whatever)".characters)
    }
    
    var $: String
        {
        get // return as a normal Swift String
        {
            return String(ar)
        }
        set (str) //Mutable: set from a Swift String
        {
            ar = Array(str.characters)
        }
    }
    
    var asString: String
        {
        get // return as a normal Swift String
        {
            return String(ar)
        }
        set (str) //Mutable: set from a Swift String
        {
            ar = Array(str.characters)
        }
    }
    
    // Return the count of total number of characters:
    var count: Int
        {
        get
        {
            return ar.count
        }
    }
    
    // Return empty status:
    
    var isEmpty: Bool
        {
        get
        {
            return ar.isEmpty
        }
    }
    
    // s[n1..<n2]
    subscript (n: Int) -> TGString
        {
        get
        {
            return TGString( [ar[n]] )
        }
        set(newValue)
        {
            if newValue.isEmpty
            {
                ar.remove(at: n) // remove element when empty
            }
            else
            {
                ar[n] = newValue.ar[0]
                if newValue.count > 1
                {
                    insert(at: n, string: newValue[1..<newValue.count])
                }
            }
        }
    }

    subscript (r: Range<Int>) -> TGString
        {
        get
        {
            return TGString( Array(ar[r]) )
        }
        set(newValue)
        {
            ar[r] = ArraySlice(newValue.ar)
        }
    }

    subscript (r: ClosedRange<Int>) -> TGString
    {
        get
        {
            return TGString( Array(ar[r]) )
        }
        set(newValue)
        {
            ar[r] = ArraySlice(newValue.ar)
        }
    }

    func right( _ len: Int) -> TGString
    {
        var l = len
        
        if l > count
        {
            l = count
        }
        return TGString(Array(ar[count - l..<count]))
    }
    
    func left(_ len: Int) -> TGString
    {
        var l = len
        
        if l > count
        {
            l = count
        }
        return TGString(Array(ar[0..<l]))
    }
    
    func mid(_ pos: Int, _ len: Int) -> TGString
    {
        if pos >= count
        {
            return TGString.empty()
        }
        
        var l = len
        
        if l > pos + len
        {
            l = pos + len
        }
        return TGString(Array(ar[pos..<pos + l]))
    }
    
    func mid(_ pos: Int) -> TGString
    {
        if pos >= count
        {
            return TGString.empty()
        }
        
        return TGString(Array(ar[pos..<count]))
    }
    
    mutating func insert(at: Int, string: TGString)
    {
        ar.insert(contentsOf: string.ar, at: at)
    }

    // Concatenate
    static func + (left: TGString, right: TGString) -> TGString
    {
        return(TGString(left.ar + right.ar) )
    }
    
    // Return an empty TGString:
    static func empty() -> TGString
    {
        return TGString([Character]())
    }
} // end TGString

// trivial isn’t? but effective...
var strabc = "abcdefghjiklmnopqrstuvwxyz"
var strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))
    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lenghts don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()

outputs this:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
pq
mnopqrst
abc34fghYZkl$$$nopqrstuvwxyz
abcABC34fghYZkl$$$nopqrstuvwxyz

Thats not to say that there aren’t features which would simplify/clarify algorithms working in this manner.

true.
This discussion was interesting, triggers further thinking,
maybe even more because it touched more principal considerations

As you know, of course, a programming language is always a compromise between human
and computer “The Machine” so to speak. It started years ago with writing
Assembler then came higher PLs like Fortran, PL/1 Cobol etc. later C and C++
to just name a few… Also we see deviations in directions like OOP FP..
(and everybody thinks they’re right of course even me :o)

What (even in this time (2017)) often seems to be an unavoidable obstacle
is the tradeoff/compromise speed/distance-from-the machine, that is,
how far optimisation aspects are emerging/surfacing through all these
layers of abstraction into the upper levels of the programming language...

In this view, the essence of this discussion was perhaps then not the triviality
wether or not one should instantiate a character array or not, but rather that
obviously (not only) in Swift these underlying optimisation aspects more or
less form a undesired restriction… ?

TedvG.
1980 - from Yes song: "Machine Messiah" - read the lyrics also: very much in context here!

···

On 25 Feb 2017, at 07:26, David Waite <david@alkaline-solutions.com> wrote:

-DW

On Feb 24, 2017, at 4:27 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

ok, I understand, thank you
TedvG

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11…14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Ted,

It might have helped if instead of being called String and Character, they were named Text

I would oppose taking a good name like “Text” and using it for Strings which are mostly for machine processing purposes, but can be human-presentable with explicit locale. A name like Text would a better fit for Strings bundled with locale etc. for the purpose of presentation to humans, which must always be in the context of some locale (even if a “default” system locale). Refer to the sections in the String manifesto[1][2]. Such a Text type is definitely out-of-scope for current discussion.

[1] https://github.com/apple/swift/blob/master/docs/StringManifesto.md#the-default-behavior-of-string
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#the-default-behavior-of-string>[2] https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions

and ExtendedGraphemeCluster.

What is expressed by Swift’s Character type is what the Unicode standard often refers to as a “user-perceived character”. Note that “character” by it self is not meaningful in Unicode (though it is often thrown about casually). In Swift, Character is an appropriate name here for the concept of a user-perceived character. If you want bytes, then you can use UInt8. If you want Unicode scalar values, you can use UnicodeScalar. If you want code units, you can use whatever that ends up looking (probably an associated type named CodeUnit that is bound to UInt8 or UInt16 depending on the encoding).

···

On Feb 25, 2017, at 3:26 PM, David Waite via swift-evolution <swift-evolution@swift.org> wrote:

They don’t really have the same behavior or functionality as string/characters in many other languages, especially older languages. This is because in many languages, strings are not just text but also random-accesss (possibly binary) data.

Thats not to say that there aren’t a ton of algorithms where you can use Text like a String, treat ExtendedGraphemeCluster like a character, and get unicode behavior without thinking about it.

But when it comes to random access and/or byte modification, you are better off working with something closer to a traditional (byte) string interface.

Trying to wedge random access and byte modification into the Swift String will simply complicate everything, slow down the algorithms which don’t need it, eat up more memory, as well as slow down bridging between Swift and Objective C code.

Hence me suggesting earlier working with Data, [UInt8], or [Character] within the context of your manipulation code, then converting to a Swift String at the end. Convert to the data format you need, then convert back.

Thats not to say that there aren’t features which would simplify/clarify algorithms working in this manner.

-DW

On Feb 24, 2017, at 4:27 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

ok, I understand, thank you
TedvG

On 25 Feb 2017, at 00:25, David Sweeris <davesweeris@mac.com <mailto:davesweeris@mac.com>> wrote:

On Feb 24, 2017, at 13:41, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11…14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

It mutates because the String has to instantiate the Array<Character> to which you're indexing into, if it doesn't already exist. It may not make any externally visible changes, but it's still a change.

- Dave Sweeris

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.
   

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas…

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what? I don’t get this either. most components in the real world can be accessed randomly.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium, how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements…

Sorry, with respect, we have a difference of opinion here.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

···

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Ted,

It might have helped if instead of being called String and Character, they were named Text

I would oppose taking a good name like “Text” and using it for Strings which are mostly for machine processing purposes, but can be human-presentable with explicit locale. A name like Text would a better fit for Strings bundled with locale etc. for the purpose of presentation to humans, which must always be in the context of some locale (even if a “default” system locale). Refer to the sections in the String manifesto[1][2]. Such a Text type is definitely out-of-scope for current discussion.

Oh, I would never propose such a naming change, because I am comfortable with the existing names. I’m just acknowledging that the history of string manipulation causes friction in developers coming from other languages, in that they may expect certain functionality which doesn’t make sense within String’s goals.

I was merely illustrating that there is a big difference to how strings work in traditional languages and how a truly unicode-safe strings work. In scripting languages like ruby and python, string bears the brunt of binary data handling. Even in languages like Java and C#, unicode support takes compromises that Swift seems unwilling to make.

IMO, that Swift String doesn’t have random access capabilities is not a deficiency in Swift, but can cause misunderstandings of how Swift strings differ from other languages.

and ExtendedGraphemeCluster.

What is expressed by Swift’s Character type is what the Unicode standard often refers to as a “user-perceived character”. Note that “character” by it self is not meaningful in Unicode (though it is often thrown about casually). In Swift, Character is an appropriate name here for the concept of a user-perceived character. If you want bytes, then you can use UInt8. If you want Unicode scalar values, you can use UnicodeScalar. If you want code units, you can use whatever that ends up looking (probably an associated type named CodeUnit that is bound to UInt8 or UInt16 depending on the encoding).

A character “char" in C or C++ is considered nearly universally to be an 8-bit value. A Character in Java or Char in C# is a 16 bit (UTF-16) value. All of these effectively behave as integer values (with Character in java having the unique quality of being unsigned).

IMO, that Swift Character doesn’t behave as an integer value but rather closer to a string holding one user-perceived character is not a deficiency in Swift, but can cause misunderstandings because of how Swift differs from other languages.

-DW

···

On Feb 25, 2017, at 2:54 PM, Michael Ilseman <milseman@apple.com> wrote:

On Feb 25, 2017, at 3:26 PM, David Waite via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas…

But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements…

Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

···

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Wouldn’t that turn simple character access into a mutating function?

- Dave Sweeris

···

On Feb 23, 2017, at 4:04 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Exactly.

···

Sent from my moss-covered three-handled family gradunza

On Feb 24, 2017, at 9:49 AM, David Sweeris <davesweeris@mac.com> wrote:

On Feb 23, 2017, at 4:04 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”
which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,
Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Wouldn’t that turn simple character access into a mutating function?

- Dave Sweeris

Hi Dave
Thanks for your time to go in to this an explain.
This optimising goes much further then I thought.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

yes, I understand, it would become too iterative and time consuming I guess.
( how can you become work-detached if you keep doing things like this during vacation? )

Enjoy your vacation!
TedvG

···

On 24 Feb 2017, at 22:40, Dave Abrahams <dabrahams@apple.com> wrote:

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas…

But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements…

Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Hi David & Dave

can you explain that in more detail?

Wouldn’t that turn simple character access into a mutating function?

assigning like s[11…14] = str is of course, yes.
only then - that is if the character array thus has been changed -
it has to update the string in storage, yes.

but str = s[n..<m] doesn’t. mutate.
so you’d have to maintain keep (private) a isChanged: Bool or bit.
a checksum over the character array .
?

Kind Regards
TedvG

···

On 24 Feb 2017, at 22:40, Dave Abrahams <dabrahams@apple.com> wrote:

Sent from my moss-covered three-handled family gradunza

On Feb 23, 2017, at 2:04 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

On 23 Feb 2017, at 02:24, Dave Abrahams <dabrahams@apple.com <mailto:dabrahams@apple.com>> wrote:

Equally a non-starter. All known threadsafe schemes that require caches to be updated upon non-mutating operations have horrible performance issues, and further this would penalize all string code by reserving space for the cache and filling it even for the vast majority of operations that don't require random access.

Well, maybe “caching” is not the right description for what I've suggested.
It is more like:
  let all strings be stored as they are now, but as soon as you want to work with
random accessing parts of a string just “lift the string out of normal optimised string storage”
and then add (temporarily) a Character array so one can work with this array directly ”

That's a cache.

which implies that all other strings remain as they are. ergo: efficiency
is only reduced for the “elevated” strings,

You have to add that temporary array somewhere. The performance of every string is penalized for that storage, and also for the cost of throwing it out upon mutation. Every branch counts.

Using e.g. str.freeSpace(), if necessary, would then place the String back
in its normal storage domain, thereby disposing the Character array
associated with it.

Avoiding hidden dynamic storage overhead that needs to be freed is an explicit goal of the design (see the section on String and Substring).

Trust me, we've gotten lots of such suggestions and thought through the implications of each one very carefully.

That’s good, because it means, that a lot of people are interested in this subject and wish to help.
Of course you’ll get many of suggestions that might not be very useful,
perhaps like this one... but sometimes suddenly someone
comes along with things that might never have occurred to you.
That is the beautiful nature of ideas…

But at some point, I hope you'll understand, I also have to say that I think all the simple schemes have been adequately explored and the complex ones all seem to have this basic property of relying on caches, which has unacceptable performance, complexity, and, yes, usability costs. Analyzing and refuting each one in detail begins to be a waste of time after that. I'm not really willing to go further down this road unless someone has an implementation and experimental evidence that demonstrates it as non-problematic.

I'm afraid you will have to accept being disappointed about this.

Well, like most developers, I am a stubborn kind of guy..
Luckily Swift is very flexible like Lego, so I rolled my own convenience struct.
If I need direct access on a string I simply copy the string to it.
it permits things like this: (and growing)

let strabc = "abcdefghjiklmnopqrstuvwxyz"
let strABC = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var abc = TGString(strabc)
var ABC = TGString(strABC)

func test()
{
    // as in Basic: left$, mid$, right$
    print(abc.left(5))
    print(abc.mid(5,10))
    print(ABC.mid(5))
    print(ABC.right(5))

    // ranges and concatenation:
    print(abc[12..<23])
    print(abc.left(5) + ABC.mid(6,6) + abc[10...25])
    
    // eat anything:
    let d:Double = -3.14159
    print(TGString(d))

    let n:Int = 1234
    print(TGString(n))
    
    print(TGString(1234.56789))
    
    let str = abc[15..<17].asString // Copy to to normal Swift String
    print(str)
    
    let s = "\(abc[12..<20])" // interpolate to normal Swift String.
    print(s)
    
    abc[3..<5] = TGString("34") // if lengths don't match:
    abc[8...9] = ABC[24...25] // length of dest. string is altered.
    abc[12] = TGString("$$$$") // if src l > 1 will insert remainder after dest.12 here
    abc[14] = TGString("") // empty removes character at pos.
    print(abc)
    abc.insert(at: 3, string: ABC[0..<3])
    print(abc)
}

test()
.
outputs:
abcde
fghjiklmno
FGHIJKLMNOPQRSTUVWXYZ
VWXYZ
mnopqrstuvw
abcdeGHIJKLklmnopqrstuvwxyz
-3.14159
1234
1234.56789
abcdefghjiklmnopqrstuvwxyz
mnopqrst
abc34fghYZkl$$$$nopqrstuvwxyz
abcABC34fghYZkl$$$$nopqrstuvwxyz

kinda hoped that this could be builtin in Swift strings
Anyway, I’ve made myself what I wanted, which happily co-exists
alongside normal Swift strings. Performance and storage
aspects of my struct TGString are not very important, because
I am not using this on thousands of strings.
Simply want to use a string as a plain array, that’s all,
which is implemented in almost every PL on this planet.

More generally, there's a reason that the collection model has bidirectional and random access distinctions: important data structures are inherently not random access.

I don’t understand the above line: definition of “important data structures” <> “inherently”

Important data structures are those from the classical CS literature upon which every practical programming language (and even modern CPU hardware) is based, e.g. hash tables. Based on the properties of modern string processing, strings fall into the same category. "Inherent" means that performance characteristics are tied to the structure of the data or problem being solved. You can't sort in better than O(N log N) worst case (mythical quantum computers don't count here), and that's been proven mathematically. Similarly it's easy to prove that the constraints of our problem mean that counting the characters in a string will always be O(N) worst case where N is the length of the representation. That means strings are inherently not random access.

Heroic attempts to present the illusion that they are randomly-accessible are not going to fly.

  ?? Accessing discrete elements directly

All collections have direct access via indices. You mean randomly, via arbitrary integers.

in an array is not an illusion to me.
(e.g. I took the 4th and 7th eggs from the container)

It's not an illusion when they're stored in an array.

If you have to walk down an aisle of differently sized cereal boxes to pick the 5th box of SuperBoomCrisp Flakes off the shelf in the grocery store, that's not random access (even if you're willing to drop the boxes into an array for later lookups as you go, as you're proposing). That's what Strings are like.

These abstractions always break down, leaking the true non-random-access nature in often unpredictable ways, penalizing lots of code for the sake of a very few use-cases, and introducing complexity that is hard for the optimizer to digest and makes it painful (sometimes impossible) to grow and evolve the library.

Is an Array an abstraction? of what?

A randomly-accessible homogeneous tail growable collection. But the abstraction in question here isn't Array; it's RandomAccessCollection.

I don’t get this either. most components in the real world can be accessed randomly.

Actually that's far from being true in the real world. See the grocery store above. Computer memory is very unlike most things in the real world. Ask any roboticist.

This should be seen as a general design philosophy: Swift presents abstractions that harmonize with, rather than hide, the true nature of things.

The true nature of things is a very vague and subjective criterium,

Not at all; see my explanation above.

how can you harmonise with that, let alone with abstractions?
e.g. for me: “the true nature of things” for an array is that it has direct accessible discrete elements…

Arrays definitely support random access. Strings are not arrays.

Sorry, with respect, we have a difference of opinion here.

That's fine. Please don't be offended that I don't wish to argue it further. It's been an interesting exercise while I'm on vacation and I hoped it would lay out some general principles that would be useful to others in future even if you are not convinced, but when I get back to work next week I'll have to focus on other things.

Thanks btw for the link to this article about tagged pointers, very interesting.
it inspired me to (have) read other things in this domain as well.

TedvG

Question of newbie (sorry!):

Is it true or false that any grapheme cluster can be translated into a composite character? I understand this is certainly true at least for surrogate pairs who can be translated into one code point (i.e. one UTF32 character).

If that is true, we could transparently convert any text file (whether UTF32, UTF16, UTF8 or UTF7, LOW or HIGH) into a series of UTF32 composite characters: we would then gain random access of characters in strings (essential for NLP parsers), as well as reversibility data <=> string <=> array of characters.

I’m almost sure it can’t be true, else Unicode Variant selector would be pointless.

Moreover, using UTF-32 for memory representation is a major waste of space, especially for parsers that need to be able to stream the content.

To get random access, you would have to keep the whole converted array in memory.

It is false.

:woman_zombie: is:

  • 1 grapheme (Swift's Character)
  • 4 scalars (Swifts Unicode.Scalar):
    • U+1F9DF, U+200D, U+2640, U+FE0F
  • 5 UTF-16 code units (Swift's String.UTF16View.Element):
    • D83E DDDF 200D 0020 2640
  • 11 UTF-8 code units (Swift's String.UTF8View.Element):
    • F0 9F A7 9F E2 80 8D 20 E2 99 80

edit: Bah, Discourse doesn't have support for the latest emoji! This cannot stand! For reference, here is the emoji: 🧟‍♀️ Woman Zombie Emoji

1 Like

I am being stuborn…

Jean-Daniel: I need to parse texts in over 30 languages, including Arabic, Chinese, etc. My parser needs constant time access to any character, even if this character has one or two or more diacritics (for instance, in Arabic and Hebrew texts, vowels are almost always absent). On the other hand, the size of the texts to parse are seldom over a few hundred mega-bytes (200 MB for a full year of the newspaper Le Monde), therefore, with 64GB RAM available on basically any desktop computer (even on my iPhone), parsing UTF32 texts is not a problem…

Michael: you are right, but let me re-aim my question: is that the case that for all natural languages, any letter with all its diacritics (e.g. Hebrew shin + dagesh + dot), or ligatures (oe), or combined letters (e.g. Korean triplets) can be associated with an equivalent composite character, i.e. has one 4-byte uniscalar?

Do you consider Emoji to not be “natural language”? Because there are plenty of Emoji which decompose to more than 4 Unicode scalars:

let c = "👩‍👩‍👧‍👧"
print(Array(c)) // => ["👩‍👩‍👧‍👧"]
print(Array(c.unicodeScalars)) // => ["\u{0001F469}", "\u{200D}", "\u{0001F469}", "\u{200D}", "\u{0001F467}", "\u{200D}", "\u{0001F467}"]

There are also conjunct characters in scripts such as Devanagari which are considered one logical character, even though the Unicode rules for grapheme clusters split them out into different characters, IIRC:

let str = "ख्ज्ञ"
print(Array(str)) // => ["ख्", "ज्", "ञ"]
print(Array(str.unicodeScalars)) // => ["\u{0916}", "\u{094D}", "\u{091C}", "\u{094D}", "\u{091E}"]

That is incorrect. There are many natural languages the rely heavily on combining characters for which there is no precomposed form. Off the top of my head, I remember Devangari, Farsi (I think), Tamil, and even Vietnamese (which is Latin-based) were examples.

1 Like

My understanding was that:

– Vietnamese NFKC normalization’s purpose is exactly to recompose any Vietnamese decomposed sequence of scalars into the equivalent composite character (i.e. one 4-byte scalar), e.g. “a” + circumflex + tilde = U+1EAB

– Devanagari’s scripts combine two or three actual letters into one glyph, but from a linguistic point of view, the resulting glyph still represents the sequence of the initial letters, similar to ligatures in latin languages, e.g. “ffi” = “f” + “f” + “i” (difficult = difficult)

– Farsi (and most Arabic-based scripts) do have ligatures that must be processed as one logical unit, but these units function like words (rather than letters), similar to latin-based abbreviations such as "& = Latin “et”.

– Same thing with Emoji: I believe linguistic parsers should consider emoji as words rather than letters, e.g. “:heart:️” = “Noun heart” or “Verb to love”, “:grinning:” = “Adjective happy” or “Adverb Happily”, etc. They look like abbreviations, like “$” = “dollar”

I need to do more research…

It sounds like what you want is to operate directly on the unicode scalars instead of graphemes. String has a UnicodeScalarView, which lazily decodes the contents for this purpose. If you want to make it eager, you can say Array(myStr.unicodeScalars).

If you need control concerning normalization, that could be added. If so, feel free to start a pitch for this functionality.

edit: code formatting

Thanks Michael, itaiferber,and Jean-Daniel,

From your feedback, I understand that it would be possible to get a “composition” operation that follows the NFKC standard to get an array of unicode scalars for a given string. This would allow one to build a robust syntactic parser that could parse texts in any natural language. Ligatures would not be a linguistic problem because NFKC properly separates them into sequences of letters; from a linguistic point of view, emojis should be processed as words rather than letters.

From a previous discussion with Michael, I understand that a standard “decomposition” operation such as NFKD could be available as well. Such a feature would allow one to build a robust morphological parser, that could add or remove accents or stresses to/from a letter in a linguistically natural way. That is crucial for all languages that have a heavy morphology.

Another missing feature that is desperately needed: a data <=> string reversibility access that would allow a parser to tell its client at what exact position in the initial data (byte array) the match actually occurred. Without this basic feature, I don’t see how to build serious NLP applications.

I don’t know how to “start a pitch” to get these three features; are “pitches” Apple official procedure? I understand that the String team has other priorities and has limited Human resources (maybe I can help?). I do hope that Swift will get these features, hopefully before June when I will have to start to work full time on the next version of the NooJ linguistic platform.

Since this is an old thread (pre-forum) and likely to involve more discussion around specific use cases, I spun off String for linguistic processing.