Strings in Swift 4

TedvG · February 5, 2017, 10:57pm

We know that:
The cumbersome complexity of current Swift String handling
and programming is caused by the fact that Unicode characters
are stored and processed as streams/arrays with elements
of variable-width (1...4 bytes for each character) Unicode characters.

Because of that, direct subscripting of string elements e.g. str[2..<18]
is not possible.Therefore it was, and still is, not implemented in Swift,
much to the unpleasant surprise of many new Swift programmers
coming from many other PLs like me. They did miss plain direct subscripting
so much that the first thing ever they do before using Swift intensively is
implementing the following or similar dreadful code (at least for direct
subscripting), and bury it deep into a String extension, once written,
hopefully never to be seen again, like in this example:

extension String
{
   subscript(i: Int) -> String
   {
        guard i >= 0 && i < characters.count else { return "" }
        return String(self[index(startIndex, offsetBy: i)])
    }

    subscript(range: Range<Int>) -> String
    {
        let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
        return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound, limitedBy: endIndex) ?? endIndex))
    }

    subscript(range: ClosedRange<Int>) -> String
    {
        let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
        return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound + 1, limitedBy: endIndex) ?? endIndex))
    }
}

[splendid jolly good Earl Grey tea is now being served to help those flabbergasted to recover as quickly as possible.]

This rather indirect and clumsy way of working with string data is because
(with the exception of UTF-32 characters) Unicode characters come in
variable-width encoding (1 to 4 bytes for each char), which as we know
makes string handling for UTF-8, UTF-16 very complex and inefficient.
E.g. to isolate a substring it is necessary to sequentially
traverse the string instead of direct access.

However, that is not the case with UTF-32, because with UTF-32 encoding
each character has a fixed-width and always occupies exactly 4 bytes, 32 bit.
Ergo: the problem can be easily solved: The simple solution is to always
and without exception use UTF-32 encoding as Swift's internal
string format because it only contains fixed width Unicode characters.

Unicode strings with whatever UTF encoding as read into the program would
be automatically converted to 32-bit UTF32 format. Note that explicit conversion
e.g. back to UTF-8, can be specified or defaulted when writing Strings to a
storage medium or URL etc.

Possible but imho not recommended: The current String system could be pushed
down and kept alive (e.g. as Type StringUTF8?) as a secondary alternative to
accommodate those that need to process very large quantities of text in core.

What y'all think?
Kind regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

dabrahams · February 6, 2017, 4:15am

Those are not (user-perceived) Characters; they are Unicode Scalar Values (often called "characters" by the Unicode standard. Characters as defined in Swift (a.k.a. extended grapheme clusters) have no fixed-width encoding, and Unicode scalar values are an inappropriate unit for most string processing. Please read the manifesto for details.

···

On Feb 5, 2017, at 2:57 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

However, that is not the case with UTF-32, because with UTF-32 encoding
each character has a fixed-width and always occupies exactly 4 bytes, 32 bit.
Ergo: the problem can be easily solved: The simple solution is to always
and without exception use UTF-32 encoding as Swift's internal
string format because it only contains fixed width Unicode characters.

Sent from my iPad

TedvG · February 6, 2017, 5:26pm

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

To make the distinction between a “plain” single UTF-32 char and a grapheme cluster,
set the most significant bit of the 32 bit value to 1 and use the other 31 bits
as a pointer to another (hidden) String instance, containing the grapheme cluster.
In this way, one could then also make graphemes within graphemes,
but that is probably not desired? Another solution is to store the grapheme clusters
in a dedicated “grapheme pool’, containing the (unique as in aSet) grapheme clusters
encountered whenever a Unicode string (in whatever format) is read-in or defined at runtime.

but then again.. seeing how hard it is to recognise Grapheme clusters in the first place..
? I don’t know. Unicode is complicated..

Kind regards
TedvG.

www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

···

On 6 Feb 2017, at 05:15, Dave Abrahams <dabrahams@apple.com> wrote:

On Feb 5, 2017, at 2:57 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:

However, that is not the case with UTF-32, because with UTF-32 encoding
each character has a fixed-width and always occupies exactly 4 bytes, 32 bit.
Ergo: the problem can be easily solved: The simple solution is to always
and without exception use UTF-32 encoding as Swift's internal
string format because it only contains fixed width Unicode characters.

Those are not (user-perceived) Characters; they are Unicode Scalar Values (often called "characters" by the Unicode standard. Characters as defined in Swift (a.k.a. extended grapheme clusters) have no fixed-width encoding, and Unicode scalar values are an inappropriate unit for most string processing. Please read the manifesto for details.

Sent from my iPad

dwaite · February 6, 2017, 6:10pm

The random access would require a uniform layout, so a pointer and scalar would need to be the same size. The above would work with a 32 bit platform with a tagged pointer, but would require a 64-bit slot for pointers on 64-bit systems like macOS and iOS.

Today when I need to do random access into a string, I convert it to an Array<Character>. Hardly efficient memory-wise, but efficient enough for random access.

-DW

···

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

dwaite · February 6, 2017, 8:39pm

The random access would require a uniform layout, so a pointer and scalar would need to be the same size. The above would work with a 32 bit platform with a tagged pointer, but would require a 64-bit slot for pointers on 64-bit systems like macOS and iOS.

Today when I need to do random access into a string, I convert it to an Array<Character>. Hardly efficient memory-wise, but efficient enough for random access.

-DW

···

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

dabrahams · February 6, 2017, 10:25pm

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode

and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

The random access would require a uniform layout, so a pointer and
scalar would need to be the same size. The above would work with a 32
bit platform with a tagged pointer, but would require a 64-bit slot
for pointers on 64-bit systems like macOS and iOS.

It would also make String not efficiently interoperable with almost any
other system that processes strings including Foundation and ICU.

Today when I need to do random access into a string, I convert it to
an Array<Character>. Hardly efficient memory-wise, but efficient
enough for random access.

I'd be willing to bet almost anything that you never actually need to
do random access into a String ;-)

···

on Mon Feb 06 2017, David Waite <david-AT-alkaline-solutions.com> wrote:

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> > wrote:

--
-Dave

TedvG · February 6, 2017, 6:29pm

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

The random access would require a uniform layout, so a pointer and scalar would need to be the same size. The above would work with a 32 bit platform with a tagged pointer, but would require a 64-bit slot for pointers on 64-bit systems like macOS and iOS.

Yeah, I know that, but the “grapheme cluster pool” I am imagining
could be allocated at a certain predefined base address,
whereby the pointer I am referring to is just an offset from this base address.
If so, an address space of 2^30 (1,073,741,824) 1 GB, will be available,
which is more than sufficient for just storing unique grapheme clusters..
(of course, not taking in account other allocations and app limitations)

Today when I need to do random access into a string, I convert it to an Array<Character>. Hardly efficient memory-wise, but efficient enough for random access.

As a programmer. I just want to use String as-is but with direct subscripting like str[12..<34]
and, if possible also with open range like so: str[12…]
implemented natively in Swift.

Kind Regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

···

On 6 Feb 2017, at 19:10, David Waite <david@alkaline-solutions.com> wrote:

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

-DW

TedvG · February 6, 2017, 11:03pm

Thanks ..seeing now all this being heavily intertwined with external libs ICU etc.
then yes, too much effort for too little (making fixed width Unicode strings)
Why am i doing this? Unicode is a wasp nest, how do you survive, Dave ? :o)

But I do use “random string access" e.g. extracting substrings
with e.g. let part = str[3..<6]
with the help of the aforementioned String extension..

arrgh, great, make me a tea...

TedvG

···

On 6 Feb 2017, at 23:25, Dave Abrahams <dabrahams@apple.com> wrote:

on Mon Feb 06 2017, David Waite <david-AT-alkaline-solutions.com> wrote:

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> >> wrote:

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode

and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

The random access would require a uniform layout, so a pointer and
scalar would need to be the same size. The above would work with a 32
bit platform with a tagged pointer, but would require a 64-bit slot
for pointers on 64-bit systems like macOS and iOS.

It would also make String not efficiently interoperable with almost any
other system that processes strings including Foundation and ICU.

Today when I need to do random access into a string, I convert it to
an Array<Character>. Hardly efficient memory-wise, but efficient
enough for random access.

I'd be willing to bet almost anything that you never actually need to
do random access into a String ;-)

--
-Dave

Karl · February 7, 2017, 4:42am

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

The random access would require a uniform layout, so a pointer and scalar would need to be the same size. The above would work with a 32 bit platform with a tagged pointer, but would require a 64-bit slot for pointers on 64-bit systems like macOS and iOS.

Yeah, I know that, but the “grapheme cluster pool” I am imagining
could be allocated at a certain predefined base address,
whereby the pointer I am referring to is just an offset from this base address.
If so, an address space of 2^30 (1,073,741,824) 1 GB, will be available,
which is more than sufficient for just storing unique grapheme clusters..
(of course, not taking in account other allocations and app limitations)

When it comes to fast access what’s most important is cache locality. DRAM is like 200x slower than L2 cache. Looping through some contiguous 16-bit integers is always going to beat the pants out of derefencing pointers.

Today when I need to do random access into a string, I convert it to an Array<Character>. Hardly efficient memory-wise, but efficient enough for random access.

As a programmer. I just want to use String as-is but with direct subscripting like str[12..<34]
and, if possible also with open range like so: str[12…]
implemented natively in Swift.

Kind Regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

-DW

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

It’s quite rare that you need to grab arbitrary parts of a String without knowing what is inside it. If you’re saying str[12..<34] - why 12, and why 34? Is 12 the length of some substring you know from earlier? In that case, you could find out how many CodeUnits it had, and use that information instead.

The new model will give you some form of efficient “random” access; the catch is that it’s not totally random. Looking for the next character boundary is necessarily linear, so the trick for large strings (>16K) is to make sure you remember the CodeUnit offsets of important character boundaries.

- Karl

···

On 6 Feb 2017, at 19:29, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

On 6 Feb 2017, at 19:10, David Waite <david@alkaline-solutions.com <mailto:david@alkaline-solutions.com>> wrote:

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

TedvG · February 7, 2017, 4:28pm

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless:

How about this solution: (if I am not making other omissions in my thinking again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
> >
2: [UTF32, UTF32] [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.

The random access would require a uniform layout, so a pointer and scalar would need to be the same size. The above would work with a 32 bit platform with a tagged pointer, but would require a 64-bit slot for pointers on 64-bit systems like macOS and iOS.

Yeah, I know that, but the “grapheme cluster pool” I am imagining
could be allocated at a certain predefined base address,
whereby the pointer I am referring to is just an offset from this base address.
If so, an address space of 2^30 (1,073,741,824) 1 GB, will be available,
which is more than sufficient for just storing unique grapheme clusters..
(of course, not taking in account other allocations and app limitations)

When it comes to fast access what’s most important is cache locality. DRAM is like 200x slower than L2 cache. Looping through some contiguous 16-bit integers is always going to beat the pants out of derefencing pointers.

Hi Karl
That is of course hardware/processor dependent…and Swift runs on different target systems… isn’t?

Today when I need to do random access into a string, I convert it to an Array<Character>. Hardly efficient memory-wise, but efficient enough for random access.

As a programmer. I just want to use String as-is but with direct subscripting like str[12..<34]
and, if possible also with open range like so: str[12…]
implemented natively in Swift.

Kind Regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

-DW

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

It’s quite rare that you need to grab arbitrary parts of a String without knowing what is inside it. If you’re saying str[12..<34] - why 12, and why 34? Is 12 the length of some substring you know from earlier? In that case, you could find out how many CodeUnits it had, and use that information instead.

For this example, I have used constants here, but normally these would be variables..

I’d say it is not so rare, these things are often used for all kinds of string parsing, there are many
examples to be found on the Internet.
TedvG

···

On 7 Feb 2017, at 05:42, Karl Wagner <razielim@gmail.com> wrote:

On 6 Feb 2017, at 19:29, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On 6 Feb 2017, at 19:10, David Waite <david@alkaline-solutions.com <mailto:david@alkaline-solutions.com>> wrote:

On Feb 6, 2017, at 10:26 AM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

The new model will give you some form of efficient “random” access; the catch is that it’s not totally random. Looking for the next character boundary is necessarily linear, so the trick for large strings (>16K) is to make sure you remember the CodeUnit offsets of important character boundaries.

- Karl

dabrahams · February 7, 2017, 6:44pm

When it comes to fast access what’s most important is cache
locality. DRAM is like 200x slower than L2 cache. Looping through
some contiguous 16-bit integers is always going to beat the pants
out of derefencing pointers.

Hi Karl
That is of course hardware/processor dependent…and Swift runs on different target systems… isn’t?

Actually the basic calculus holds for any modern processor.

It’s quite rare that you need to grab arbitrary parts of a String
without knowing what is inside it. If you’re saying str[12..<34] -
why 12, and why 34? Is 12 the length of some substring you know from
earlier? In that case, you could find out how many CodeUnits it had,
and use that information instead.
For this example, I have used constants here, but normally these would be variables..

I’d say it is not so rare, these things are often used for all kinds of string parsing, there are
many
examples to be found on the Internet.
TedvG

That proves nothing, though. The fact that people are using integers to
do this doesn't mean you need to use them, nor does it mean that you'll
get the right results from doing so. Typically examples that use
integer constants with strings are wrong for some large proportion of
unicode text.

···

on Tue Feb 07 2017, "Ted F.A. van Gaalen" <tedvgiosdev-AT-gmail.com> wrote:

On 7 Feb 2017, at 05:42, Karl Wagner <razielim@gmail.com> wrote:

On 6 Feb 2017, at 19:29, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org > <mailto:swift-evolution@swift.org>> wrote:

--
-Dave

TedvG · February 7, 2017, 8:19pm

When it comes to fast access what’s most important is cache
locality. DRAM is like 200x slower than L2 cache. Looping through
some contiguous 16-bit integers is always going to beat the pants
out of derefencing pointers.

Hi Karl
That is of course hardware/processor dependent…and Swift runs on different target systems… isn’t?

Actually the basic calculus holds for any modern processor.

It’s quite rare that you need to grab arbitrary parts of a String
without knowing what is inside it. If you’re saying str[12..<34] -
why 12, and why 34? Is 12 the length of some substring you know from
earlier? In that case, you could find out how many CodeUnits it had,
and use that information instead.
For this example, I have used constants here, but normally these would be variables..

I’d say it is not so rare, these things are often used for all kinds of string parsing, there are
many
examples to be found on the Internet.
TedvG

That proves nothing, though. The fact that people are using integers to
do this doesn't mean you need to use them, nor does it mean that you'll
get the right results from doing so. Typically examples that use
integer constants with strings are wrong for some large proportion of
unicode text.

This is all a bit confusing.
in glyph - Wiktionary, the free dictionary
Definition of a glyph in our context:
(typography, computing) A visual representation of a letter <https://en.wiktionary.org/wiki/letter>, character <https://en.wiktionary.org/wiki/character>, or symbol <https://en.wiktionary.org/wiki/symbol>, in a specific font <https://en.wiktionary.org/wiki/font> and style <https://en.wiktionary.org/wiki/style>\.

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-
      2. -= a grapheme cluster always results in just a single glyph, true? =-
      3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )
     4. In this context, a glyph is a humanly recognisable visual form of a character,
     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

is this correct? (especially 1 and 2)

Based on these assumptions, to me then, the definition of a character == glyph.
Therefore, my working model: I see a row of characters as a row of glyphs,
which are discrete autonomous visual elements, ergo:
Each element is individually addressable with integers (ordinal)

?

TedvG

···

On 7 Feb 2017, at 19:44, Dave Abrahams <dabrahams@apple.com> wrote:
on Tue Feb 07 2017, "Ted F.A. van Gaalen" <tedvgiosdev-AT-gmail.com> wrote:

On 7 Feb 2017, at 05:42, Karl Wagner <razielim@gmail.com> wrote:

On 6 Feb 2017, at 19:29, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org >> <mailto:swift-evolution@swift.org>> wrote:

--
-Dave

dwaite · February 7, 2017, 9:07pm

While this is true, the encoding(s) have focused on memory size, operational performance, and compatibility with existing tools over the ability to perform integer-indexed random access. Even a Character in Swift that would be returned via random access is effectively a substring, due to the value being of arbitrary size.

In my experience, many languages and thus many developers use strings both as text and as data. This is even more common in scripting languages, where you don’t have nil-termination to contend with for storing binary data within a ‘string’.

I assume Swift has taken the standpoint that text cannot be handled properly or safely unless it is considered distinct from data, and thus the need for random access goes down significantly when String is being used properly.

Perhaps it would help if you provide a real world example where String requires random access?

-DW

···

On Feb 7, 2017, at 1:19 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com <mailto:tedvgiosdev@gmail.com>> wrote:

That proves nothing, though. The fact that people are using integers to
do this doesn't mean you need to use them, nor does it mean that you'll
get the right results from doing so. Typically examples that use
integer constants with strings are wrong for some large proportion of
unicode text.

  This is all a bit confusing.
in glyph - Wiktionary, the free dictionary
   Definition of a glyph in our context:
(typography, computing) A visual representation of a letter <https://en.wiktionary.org/wiki/letter>, character <https://en.wiktionary.org/wiki/character>, or symbol <https://en.wiktionary.org/wiki/symbol>, in a specific font <https://en.wiktionary.org/wiki/font> and style <https://en.wiktionary.org/wiki/style>\.

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-
      2. -= a grapheme cluster always results in just a single glyph, true? =-
      3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )
     4. In this context, a glyph is a humanly recognisable visual form of a character,
     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

is this correct? (especially 1 and 2)

Based on these assumptions, to me then, the definition of a character == glyph.
Therefore, my working model: I see a row of characters as a row of glyphs,
which are discrete autonomous visual elements, ergo:
Each element is individually addressable with integers (ordinal)

hooman · February 7, 2017, 11:07pm

I now assume that:
1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are by no means “plain”. They can affect (and be affected by) the characters around them, they can cause glyphs around them to rearrange or combine (like ligatures) or their visual representation (glyph) may float in the same space as an adjacent glyph (and seem to be part of the “host” glyph), etc. So, the general relationship of a character and its corresponding glyph (if there is one) is complex and depends on context and surroundings characters.

2. -= a grapheme cluster always results in just a single glyph, true? =-

False

3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.

4. In this context, a glyph is a humanly recognisable visual form of a character,

Not in a straightforward one to one fashion, not even in Latin / Roman script.

     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

False

···

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org> wrote:

TedvG · February 8, 2017, 5:29pm

Hello Hooman
That invalidates my assumptions, thanks for evaluating
it's more complex than I thought.
Kind Regards
Ted

···

On 8 Feb 2017, at 00:07, Hooman Mehr <hooman@mac.com> wrote:

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are by no means “plain”. They can affect (and be affected by) the characters around them, they can cause glyphs around them to rearrange or combine (like ligatures) or their visual representation (glyph) may float in the same space as an adjacent glyph (and seem to be part of the “host” glyph), etc. So, the general relationship of a character and its corresponding glyph (if there is one) is complex and depends on context and surroundings characters.

      2. -= a grapheme cluster always results in just a single glyph, true? =-

False

      3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.

     4. In this context, a glyph is a humanly recognisable visual form of a character,

Not in a straightforward one to one fashion, not even in Latin / Roman script.

     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

False

Shawn_Erickson · February 9, 2017, 3:48pm

I also wonder what folks are actually doing that require indexing into
strings. I would love to see some real world examples of what and why
indexing into a string is needed. Who is the end consumer of that string,
etc.

Do folks have so examples?

-Shawn

···

On Thu, Feb 9, 2017 at 6:56 AM Ted F.A. van Gaalen via swift-evolution < swift-evolution@swift.org> wrote:

Hello Hooman
That invalidates my assumptions, thanks for evaluating
it's more complex than I thought.
Kind Regards
Ted

On 8 Feb 2017, at 00:07, Hooman Mehr <hooman@mac.com> wrote:

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution < > swift-evolution@swift.org> wrote:

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one
glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are by no
means “plain”. They can affect (and be affected by) the characters around
them, they can cause glyphs around them to rearrange or combine (like
ligatures) or their visual representation (glyph) may float in the same
space as an adjacent glyph (and seem to be part of the “host” glyph), etc.
So, the general relationship of a character and its corresponding glyph (if
there is one) is complex and depends on context and surroundings characters.

      2. -= a grapheme cluster always results in just a single glyph,
true? =-

False

      3. The only thing that I can see on screen or print are glyphs
(“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the
way the surrounding shapes change or re-arrange.

     4. In this context, a glyph is a humanly recognisable visual form of
a character,

Not in a straightforward one to one fashion, not even in Latin / Roman
script.

     5. On this level (the glyph, what I can see as a user) it is not
relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even
on what kind
         of encoding the glyph was based upon.

False

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Ben_Cohen · February 9, 2017, 3:57pm

Big +1 for this. Real world use cases that are hard today would be great to see so we can make sure they are accounted for in the new API design.

The ideal situation is that there is a common pattern with many of them that can be accommodated through useful higher-level methods (maybe even on Collection) that avoid the need to mess around with indices.

···

On Feb 9, 2017, at 7:48 AM, Shawn Erickson via swift-evolution <swift-evolution@swift.org> wrote:

I also wonder what folks are actually doing that require indexing into strings. I would love to see some real world examples of what and why indexing into a string is needed. Who is the end consumer of that string, etc.

Do folks have so examples?

TedvG · February 9, 2017, 8:51pm

Hello Shawn
Just google with any programming language name and “string manipulation”
and you have enough reading for a week or so :o)
TedvG

···

On 9 Feb 2017, at 16:48, Shawn Erickson <shawnce@gmail.com> wrote:

I also wonder what folks are actually doing that require indexing into strings. I would love to see some real world examples of what and why indexing into a string is needed. Who is the end consumer of that string, etc.

Do folks have so examples?

-Shawn

On Thu, Feb 9, 2017 at 6:56 AM Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Hello Hooman
That invalidates my assumptions, thanks for evaluating
it's more complex than I thought.
Kind Regards
Ted

On 8 Feb 2017, at 00:07, Hooman Mehr <hooman@mac.com <mailto:hooman@mac.com>> wrote:

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are by no means “plain”. They can affect (and be affected by) the characters around them, they can cause glyphs around them to rearrange or combine (like ligatures) or their visual representation (glyph) may float in the same space as an adjacent glyph (and seem to be part of the “host” glyph), etc. So, the general relationship of a character and its corresponding glyph (if there is one) is complex and depends on context and surroundings characters.

      2. -= a grapheme cluster always results in just a single glyph, true? =-

False

      3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.

     4. In this context, a glyph is a humanly recognisable visual form of a character,

Not in a straightforward one to one fashion, not even in Latin / Roman script.

     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

False

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

dabrahams · February 9, 2017, 11:11pm

Hello Shawn
Just google with any programming language name and “string manipulation”
and you have enough reading for a week or so :o)
TedvG

That truly doesn't answer the question. It's not, “why do people index
strings with integers when that's the only tool they are given for
decomposing strings?” It's, “what do you have to do with strings that's
hard in Swift *because* you can't index them with integers?”

···

on Thu Feb 09 2017, "Ted F.A. van Gaalen" <tedvgiosdev-AT-gmail.com> wrote:

On 9 Feb 2017, at 16:48, Shawn Erickson <shawnce@gmail.com> wrote:

I also wonder what folks are actually doing that require indexing
into strings. I would love to see some real world examples of what
and why indexing into a string is needed. Who is the end consumer of
that string, etc.

Do folks have so examples?

-Shawn

On Thu, Feb 9, 2017 at 6:56 AM Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Hello Hooman
That invalidates my assumptions, thanks for evaluating
it's more complex than I thought.
Kind Regards
Ted

On 8 Feb 2017, at 00:07, Hooman Mehr <hooman@mac.com <mailto:hooman@mac.com>> wrote:

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I now assume that:
      1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are
by no means “plain”. They can affect (and be affected by) the
characters around them, they can cause glyphs around them to
rearrange or combine (like ligatures) or their visual
representation (glyph) may float in the same space as an adjacent
glyph (and seem to be part of the “host” glyph), etc. So, the
general relationship of a character and its corresponding glyph (if
there is one) is complex and depends on context and surroundings
characters.

      2. -= a grapheme cluster always results in just a single glyph, true? =-

False

      3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.

     4. In this context, a glyph is a humanly recognisable visual form of a character,

Not in a straightforward one to one fashion, not even in Latin / Roman script.

     5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
         with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
         of encoding the glyph was based upon.

False

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

<https://lists.swift.org/mailman/listinfo/swift-evolution>

--
-Dave

hooman · February 9, 2017, 11:45pm

Hello Shawn
Just google with any programming language name and “string manipulation”
and you have enough reading for a week or so :o)
TedvG

That truly doesn't answer the question. It's not, “why do people index
strings with integers when that's the only tool they are given for
decomposing strings?” It's, “what do you have to do with strings that's
hard in Swift *because* you can't index them with integers?”

I have done some string processing. I have not encountered any algorithm where an integer index is absolutely needed, but sometimes it might be the most convenient.

For example, there are valid reasons to keep side tables that hold indexes into a string. (such as maintaining attributes that apply to a substring or things like pre-computed positions of soft line breaks). It does not require the index to be integer, but maintaining validity of those indexes after the string is mutated requires being able to offset them back or forth from some position on. These operations could be less verbose and easier if the index happens to be integer or (efficiently) supports + - operators. Also, I know there are other methods to deal with such things and mutating a large string generally is a bad idea, but sometimes it is the easiest and most convenient solution to the problem at hand.

···

On Feb 9, 2017, at 3:11 PM, Dave Abrahams <dabrahams@apple.com> wrote:
on Thu Feb 09 2017, "Ted F.A. van Gaalen" <tedvgiosdev-AT-gmail.com> wrote:

On 9 Feb 2017, at 16:48, Shawn Erickson <shawnce@gmail.com> wrote:

I also wonder what folks are actually doing that require indexing
into strings. I would love to see some real world examples of what
and why indexing into a string is needed. Who is the end consumer of
that string, etc.

Do folks have so examples?

-Shawn

On Thu, Feb 9, 2017 at 6:56 AM Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:
Hello Hooman
That invalidates my assumptions, thanks for evaluating
it's more complex than I thought.
Kind Regards
Ted

On 8 Feb 2017, at 00:07, Hooman Mehr <hooman@mac.com <mailto:hooman@mac.com>> wrote:

On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I now assume that:
     1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-

What do you mean by “plain”? Characters in some Unicode scripts are
by no means “plain”. They can affect (and be affected by) the
characters around them, they can cause glyphs around them to
rearrange or combine (like ligatures) or their visual
representation (glyph) may float in the same space as an adjacent
glyph (and seem to be part of the “host” glyph), etc. So, the
general relationship of a character and its corresponding glyph (if
there is one) is complex and depends on context and surroundings
characters.

     2. -= a grapheme cluster always results in just a single glyph, true? =-

False

     3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )

The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.

    4. In this context, a glyph is a humanly recognisable visual form of a character,

Not in a straightforward one to one fashion, not even in Latin / Roman script.

    5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
        with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
        of encoding the glyph was based upon.

False

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

<https://lists.swift.org/mailman/listinfo/swift-evolution>

--
-Dave