Swift String history

Hello there! I am not sure if this is the right section given that my question is about "past" development and not "future" development, so feel free to move it.

So, a little bit of context. I am preparing a presentation for my team at work about Unicode and encoding and strings, and I am including a section about programming languages, their string type history and its functionalities. I am obviously including Swift, so I wanted to gain some more insight about how String and related types work and used to work.

I have read this document about UTF-8 String and this about UTF-8 everywhere and this about String's internal details and I have a good understanding of how String & co. work. What I'd like to understand better is why String's internals used to be Contiguous ASCII + Contiguous UTF-16 before Swift 5's switch to UTF-8, and why the switch was done with that timeframe (if there's a specific reason).

Thank you very much!

3 Likes

Looking forward to seeing this presentation.

Can't help with those specific questions, but here're some suggestions of what you could write about.

Suggestions
  • In many languages strings are value types even if other types like arrays/dictionaries are reference types. Note that many languages silently assume value semantics for simple types like "int" and reference semantics for complex types like dictionaries without speaking this out explicitly.

  • objc: In objective C: strings are reference types

  • objc: As mutable strings are subtypes of immutable strings (and generally as mutable types are subtypes of immutable types) this opens its own can of warms:

  • objc: Seemingly immutable string could change:

void foo(NSString* string) {
    NSLog(@"%@", string); // "hello"
    [NSThread sleepForTimeInterval:2];
    NSLog(@"%@", string); // "world" 🤔
}

void fooTest(void) {
    NSMutableString* x = [[NSMutableString alloc] initWithString:@"hello"];
    NSThread* t = [[NSThread alloc] initWithBlock:^{
        foo(x);
    }];
    [t start];
    [NSThread sleepForTimeInterval:1];
    [x setString:@"world"];
    [NSThread sleepForTimeInterval:2];
}
  • objc: To mitigate the issues the strings used as dictionary keys are copied, so this is not a problem to create a dictionary entry with a mutable string and then change that string:
    NSMutableString* a = [[NSMutableString alloc] initWithString:@"a"];
    NSString* b = @"b";
    
    NSDictionary* dict = @{a: @"a", b: @"b"};
    NSLog(@"%@", dict); // "{ a = a; b = b; }"
    [a setString:@"b"]; // doesn't change the key above
    NSLog(@"%@", dict); // "{ a = a; b = b; }" 😀
  • objc: However with Set's that's still a problem:
    NSMutableString* c = [[NSMutableString alloc] initWithString:@"c"];
    NSString* d = @"d";
    
    NSMutableSet* s = [NSMutableSet new];
    [s addObject:c];
    [s addObject:d];
    NSLog(@"count: %d, %@", s.count, s); // "count: 2, {( c, d )}"
    [c setString:@"d"]; // oops
    NSLog(@"count: %d, %@", s.count, s); // "count: 2, {( d, d )}" 😢
  • objc: Note that this issue is not related solely to NSString / NSMutableString but other types as well.

  • Apple platforms: Issues related to HFS (and related file systems) storing strings in decomposed form. I believe you can still observe that when using URL API's. Sometimes you don't care and it just works, sometimes you do care as the issues surface. Don't have examples of those off hand but remember there were quite a few. It's a big topic.

  • thread safety (IIRC NSMutableStrings were thread safe, Swift strings are not, ditto for types like array / dictionary). Worth talking about thread safety of strings in different languages.

  1. String's were originally contiguous ASCII + contiguous UTF-16 in order to be compatible with Objective-C's NSString implementation.

  2. String was changed to UTF-8 between Swift 4 & 5 because Swift 5 was the first version of Swift with a stable ABI on Apple's platforms, and if it wasn't switched to UTF-8 in that time frame then it would have been impossible to do later.

7 Likes

The 'background' section of utf8everywhere alludes to why UTF-16 was popular, not just in Apple's frameworks, but everywhere - in the Windows API, in languages such as JavaScript, C#, Java, etc.

Unicode didn't arrive complete and fully-formed in the early 1990s; it has grown (and continues to grow), and we have learned a lot about the problem space. When the first versions of the standard were released, software companies (such as Microsoft and NeXT) were eager to add support for it, and the idea was that we could essentially just expand the character size from 8 to 16 bits and otherwise keep fixed-width characters and random-access strings. That was UCS-2: fixed-width, 16-bit characters.

I find the summary document Unicode 88 and the Wiki page for Han unification to be interesting reading...

Nothing comes for free, and the price of Unicode's fixed-length 16-bit character code design is the twofold expansion of ASCII (or other 8-bit-based) text storage, as seen in the figure on the previous page. This initially repugnant consequence becomes a great deal more attractive once the alternative is considered.

The only alternative to fixed-length encoding is a variable-length scheme using some sort of flags to signal the length and interpretation of subsequent information units. Such schemes require flag-parsing overhead effort to be expended for every basic text operation, such as get next character, get previous character, truncate text, etc. Any number of variable-length encoding schemes are possible (this fact itself being a major drawback); several that have been implemented are described in a later section.

By contrast, a fixed-length encoding is flat-out simple, with all of the blessings attendant upon that virtue. The format is unambiguous, unique, and not susceptible to debate or revision. It is a logical consequence of the fundamental notion of character stream. Since it requires no flag parsing overhead, it makes all text operations easier to program, more reliable, and (mainly) faster.

Unicode 88

Anyway, it turns out that 16 bits weren't even close to being enough. Not only did Unicode massively underestimate the needs of CJK languages, there was also a need to catalogue historical texts (for instance, how else could you write a history book which uses those texts?), to allow for round-tripping with legacy encodings, and more.

And so Unicode invented the UTF-16 encoding, which reserves part of the 16-bit code space to include special flag characters known as "surrogates". In UTF-16, a lone surrogate is not a valid character.

When it became increasingly clear that 2^16 characters would not suffice, IEEE introduced a larger 31-bit space and an encoding UCS-4 that would require 4 bytes per character. This was resisted by the Unicode Consortium, both because 4 bytes per character wasted a lot of memory and disk space, and because some manufacturers were already heavily invested in 2-byte-per-character technology. The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.

In the UTF-16 encoding, code points less than 2^16 are encoded with a single 16-bit code unit equal to the numerical value of the code point, as in the older UCS-2. The newer code points greater than or equal to 2^16 are encoded by a compound value using two 16-bit code units. These two 16-bit code units are chosen from the UTF-16 surrogate range 0xD800–0xDFFF which had not previously been assigned to characters. Values in this range are not used as characters, and UTF-16 provides no legal way to code them as individual code points.

Wikipedia: UTF-16

In other words: UTF-16 is a variable-width encoding.

And that's really what it comes down to - UTF-16 was always a compatibility thing. UCS-2 was a mistake, but interfaces which worked in terms of 16-bit strings were built for it. It was necessary to retrofit a larger character space on to them somehow.

And ultimately, compatibility is also the reason why Swift's String initially chose UTF-16 as its native encoding - for compatibility with NSString, NeXTSTEP's UCS-2 String type.

And that's also why UTF-8 is better - if you're doing variable-width anyway, there's no point to 16-bit elements any more.

18 Likes

Thank you all for your responses and suggestions! I'll try to incorporate them as best as I see fit in the presentation.

So the initial design was all for the compatibility of Swift with ObjC, let me try to summarise to see if I understood. NSStrings that could provide UTF-16 contiguous memory could be bridged directly as an indirect UTF-16 Swift String, and those who could not could be bridged using the opaque format. Moreover, native Unicode Swift String instances could use the UTF-16 large string format. Hope this is right!

I'll ask my boss if I can share, but I think there will be no issues! However, please be warned that it will be introductory-level and very simply written! That aside, I hope it may be of help to shed some light on some common misconceptions about the subject (the very first reason I decided to prepare it) and to provide a summary on the most common aspects of strings in some programming languages!

1 Like

Here is the presentation! I admit it's not the prettiest I've done :sweat_smile: but I hope it'll be able to clarify some "weird" Unicode & programming languages points like it did for me while researching!

The presentation is a slightly modified version of what I've shown to my colleagues, to remove some inside jokes and adapt some other things. Please also note that this was an interactive presentation, where the colleagues could stop me and ask questions, so the slides alone don't contain all that was told during the day. If there is something that is not clear, please don't be afraid to ask!

Also, if you note something wrong or incorrect, please point it out!

Bosses be bosses there is my company's logo all around the presentation, to me it can be viewed simply as an attribution but if this violates some Forum rules please let me know an we'll address it.

6 Likes