Will Swift String's understanding of Characters remain stable?

hannesoid · April 10, 2024, 7:41am

If I have a String "0123" today and iterate it (creating an array of that), I get its Characters ["0", "1", "2", "3"]
…please imagine the example with some spicy unicode characters.

If I persist the String's UTF8 binary representation and initate a Swift String in a few years from it, will I get the same Characters (number of characters and their respective unicode scalars)?

Or could this change depending on some changing unicode rules or bug fixes?

Karl · April 10, 2024, 11:16am

Unicode rules

If a String contains only scalars which are assigned in the standard library's version of Unicode, all of the properties you mention are indeed forward-stable.

To check whether a String contains any unassigned scalars:

"some string".unicodeScalars.contains { $0.properties.age == nil }

If a String does contain unassigned scalars, its interpretation may change when processed by a version of Unicode in which that scalar is assigned to something. Note that since the Swift standard library is a system library on Apple platforms, it is possible for a user to have different Unicode versions across their various devices (e.g. an iPhone running iOS X with Unicode X, might cloud-sync data with an iPad running iPadOS Y with Unicode Y).

You can use the .age property to enforce a maximum Unicode version, if it is important for you to exclude characters which are too new and may not be reliably interpreted across all OS versions that you support.

This is a very specialised requirement and most applications do not need this.

Bug fixes

There are two possibilities to consider - whether any reworking of the standard library code will ever introduce a bug, and whether any bugs which we discover will be fixed.

Obviously nobody wants to introduce bugs, but we can't entirely rule out the possibility of them ever being introduced, either. The standard library algorithms are tested using Unicode's published test suites and other tests to try ensure they are accurate, but those test suites are also not (and cannot be) exhaustive.

So if we discover a bug, would we consider fixing it even though the fix could change the properties of some existing strings? The standard library maintainers will give you the definitive word, but IMO yes, it should always be under consideration.

The alternative would be that we continued to ship an incorrect implementation in String for stability purposes, but of course we would discourage using it, and would have to introduce some kind of fixed String2 which we would encourage developers to use instead. As more bugs may be discovered in the future, we might further have to introduce String3, String4, etc. That's clearly not good, so I think we would consider the best path forward to be modifying String's behaviour in this hypothetical scenario, even if it means some properties change behaviour.

ole · April 10, 2024, 2:35pm

I don't think this is true without exception, or at least it wasn't true in the past. For example, Unicode 9.0 changed the grapheme cluster boundary rules for emoji flags.

The rule in Unicode 9.0 (still current in Unicode 15.1):

Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.

The previous rule in Unicode 8.0 was:

Do not break between regional indicator symbols.

This changed the grapheme breaking rules for existing code points (the regional indicator symbols were introduced in Unicode 6.0).

For example, under Unicode 8.0 rules the string let str = "🇦🇷🇯🇵" would be treated as a single Character consisting of four code points. Since Unicode 9.0, it is treated as two Characters with two code points each.

scanon · April 10, 2024, 2:47pm

Yes, if we discover bugs in our implementation of the Unicode spec, we will fix them (and have done so in the past--@Alejandro probably has an example or two handy). If we discovered that there were clients depending on the buggy behavior, we might consider adding affordances to allow them to continue "working".

hannesoid · April 17, 2024, 9:38am

Many thanks everybody for the helpful replies !

It helped us decide to do our data modelling on the UTF8 bytes level instead of characters (in the context of a CRDT where individual elements have stable identifiers).