Ted, that sort of implementation grows many common strings by a factor of 8 and makes some less common strings require multiple memory allocations. Considering that our research has shown it is a big performance and energy-use win to heroically compress <https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-pointers.html> strings to avoid both kinds of bloat (plenty of actual data was gathered before tagged pointer strings were added to Cocoa), a scheme like the one you're proposing is pretty much a non-starter as far as I'm concerned.
···
Sent from my moss-covered three-handled family gradunza
On Feb 22, 2017, at 5:56 AM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:
Hi Ben,
thank you, yes, I know all that by now.Have seen that one goes to great lengths to optimise, not only for storage but also for speed. But how far does this need to go? In any case, optimisation should not be used
as an argument for restricting a PLs functionality that is to refrain from PL elements which are common and useful.?I wouldnât worry so much over storage (unless one wants to load a complete book into memory⊠in iOS, the average app is about 15-50 MB, String data is mostly a fraction of that. In macOS or similar Iâd think it is even less significantâŠ
I wonder how much performance and memory consumption would be different from the current contiguous memory implementation? if a String is just is a plain row of (references to) Character (extended grapheme cluster) objects, Array<[Character>, which would simplify the basic logic and (sub)string handling significantly, because then one has direct access to the Stringâs elements directly, using the reasonably fast access methods of a Swift Collection/Array.
I have experimented with an alternative String struct based upon Array<Character>, seeing how easy it was to implement most popular string handling functions as one can work with the Character array directly.
Currently at deep-dive-depth in the standard lib sources, especially String & Co.
Kind Regards
TedvGOn 21 Feb 2017, at 01:31, Ben Cohen <ben_cohen@apple.com> wrote:
Hi Ted,
While Character is the Element type for String, it would be unsuitable for a Stringâs implementation to actually use Character for storage. Character is fairly large (currently 9 bytes), very little of which is used for most values. For unusual graphemes that require more storage, it allocates more memory on the heap. By contrast, Stringâs actual storage is a buffer of 1- or 2-byte elements, and all graphemes (what we expose as Characters) are held in that contiguous memory no matter how many code points they comprise. When you iterate over the string, the graphemes are unpacked into a Character on the fly. This gives you an user interface of a collection that superficially appears to resemble [Character], but this does not mean that this would be a workable implementation.
On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <tedvgiosdev@gmail.com> wrote:
Hi Ben, Dave (you should not read this now, youâre on vacation :o) & Others
As described in the Swift Standard Library API Reference:
The Character type represents a character made up of one or more Unicode scalar values,
grouped by a Unicode boundary algorithm. Generally, a Character instance matches what
the reader of a string will perceive as a single character. The number of visible characters is
generally the most natural way to count the length of a string.
The smallest discrete unit we (app programmers) are mostly working with is this
perceived visible character, what else?If that is the case, my reasoning is, that Strings (could / should? ) be relatively simple,
because most, if not all, complexity of Unicode is confined within the Character object and
completely hidden** for the average application programmer, who normally only needs
to work with Strings which contains these visible Characters, right?
It doesnât then make no difference at all âwhatâ is inâ the Character, (excellent implementation btw)
(Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever)
because we rely in sublime oblivion for the visually representation of whatever is in
the Character on miraculous font processors hidden in the dark depths of the OS.Then, in this perspective, my question is: why is String not implemented as
directly based upon an array [Character] ? In that case one can refer to the Characters of the
String directly, not only for direct subscripting and other String functionality in an efficient way.
(i do hava scope of independent Swift here, that is interaction with libraries should be
solved by the compiler, so as not to be restricted by legacy ObjC etc.** (expect if one needs to do e.g. access individual elements and/or compose graphics directly?
but for this purpose the Characterâs properties are accessible)For the sake of convenience, based upon the above reasoning, I now âemulate" this in
a string extension, thereby ignoring the rare cases that a visible character could be based
upon more than a single Character (extended grapheme cluster) If that would occur,
thye should be merged into one extended grapheme cluster, a single Character that is.//: Playground - implement direct subscripting using a Character array
// of course, when the String is defined as an array of Characters, directly
// accessible it would be more efficient as in these extension functions.
extension String
{
var count: Int
{
get
{
return self.characters.count
}
}subscript (n: Int) -> String
{
return String(Array(self.characters)[n])
}
subscript (r: Range<Int>) -> String
{
return String(Array(self.characters)[r])
}
subscript (r: ClosedRange<Int>) -> String
{
return String(Array(self.characters)[r])
}
}func test()
{
let zoo = "Koala , Snail , Penguin , Dromedary "
print("zoo has \(zoo.count) characters (discrete extended graphemes):")
for i in 0..<zoo.count
{
print(i,zoo[i],separator: "=", terminator:" ")
}
print("\n")
print(zoo[0..<7])
print(zoo[9..<16])
print(zoo[18...26])
print(zoo[29...39])
print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39])
}test()
this works as intended and generates the following output:
zoo has 40 characters (discrete extended graphemes):
0=K 1=o 2=a 3=l 4=a 5= 6=đš 7=, 8= 9=S 10=n 11=a 12=i 13=l 14= 15=đ 16=, 17=
18=P 19=e 20=n 21=g 22=u 23=i 24=n 25= 26=đ§ 27=, 28= 29=D 30=r 31=o 32=m
33=e 34=d 35=a 36=r 37=y 38= 39=đȘKoala
Snail
Penguin
Dromedary
images:I donât know how (in) efficient this method is.
but in many cases this is not so important as e.g. with numerical computation.I still fail to understand why direct subscripting strings would be unnecessary,
and would like to see this built-in in Swift asap.Btw, I do share the concern as expressed by Rien regarding the increasing complexity of the language.
Kind Regards,
TedvG