Swift.org blog: UTF-8 String

(Ted Kremenek) #1

There is a great new blog post titled "UTF-8 String", authored by @Michael_Ilseman. It talks about the change to String in Swift 5 to move to using UTF-8 as its Unicode encoding strategy.

Please use this forum thread to post questions related to that blog post.

19 Likes
#2

The blog post says:

But in the table of examples it shows:

いろは
Scalars U+3044 U+308D U+306F
UTF-8 E3 81 84 E3 82 8D E3 81 AF
UTF-16 44 30 8D 30 6F 30

So if I understand correctly, either the wording or the percentage should be changed.

• • •

Also, unrelated to that, and rather pedantically, I would suggest changing “comprised of” to either “comprising” or “composed of”.

(Jordan Rose) #3

The size comparisons are always using UTF-8 as the baseline. So here the ratio is 9 UTF-8 bytes / 6 UTF-16 bytes = 3/2 --> 50% more, and for ASCII the ratio is 2 UTF-8 bytes / 4 UTF-16 bytes = 1/2 --> 50% less.

3 Likes
#4

The example shows UTF-16 using 6 bytes where UTF-8 uses 9. Thus UTF-16 uses 33⅓% less memory than UTF-8 in that example, and UTF-8 uses 50% more than UTF-16.

The current wording says “UTF-16 uses 50% less memory than UTF-8”. But in the example, the amount of memory that UTF-8 uses is 9 bytes, and 50% of 9 is 4½. So unless the UTF-16 encoding is actually using 4½ bytes, the statement in the blog post is factually incorrect.

(Jon Shier) #5

It would be great if the blog post could show the benchmark results mentioned. Only the NSString bridging results are linked and there were a lot of wins here.

(Chris Comeau) #6

Great article, thanks!

Just wanted to report a small typo in the article:

which can expensive

seems to be missing a "be".

1 Like
(Michael Ilseman) #7

Good catch! The post has been updated.

Right, the Objective-C interop improvements mostly landed in one single PR so it’s easy to link to. UTF-8 string landed with significant wins, but some regressions that were fixed in later PRs, and even more optimizations after that. Since it was replacing a significant portion of the code in the stdlib, it was important to land early. The best we can do is either pick from various PRs (apples-to-apples), or compare across releases. Comparing across releases is less accurate, as a release contains many changes that can perturb the results, for example exclusivity enforcement.

@Erik_Eckstein, do we typically publish benchmark comparisons across releases?

Would this be clearer:

For any ASCII portion of a string’s content, UTF-8 uses 50% less memory than UTF-16. For any portion comprised of latter-BMP scalars, UTF-8 uses 50% more memory than UTF-16.

(Jon Shier) #8

Less accurate for microbenchmarks for development purposes maybe, but they would give readers a much better idea of the performance to expect from one release to another. Swift should be doing such general benchmarks anyway to ensure there are no performance regressions between major releases.

1 Like
#9

If by “clearer” you mean “factually correct” then yes, yes it would. :-)

(Though I still hold that “comprised of” should be either “comprising” or “composed of” there. The difference is that “The whole comprises the parts” whereas “The parts compose the whole”.)

2 Likes
#10

I’m bumping this thread because the objectively false statement still appears in the blog post, and @Michael_Ilseman’s correction has not been incorporated.

Is there a better process than posting here, for getting fixes pushed to swift.org?

(Jeremy David Giesbrecht) #11

From the Oxford Dictionary of English:

comprise | kəmˈprʌɪz |

verb [with object]
consist of; be made up of: the country comprises twenty states.
• make up or constitute (a whole): this single breed comprises 50 per cent of the Swiss cattle population | (be comprised of): documents are comprised of words.

USAGE
Comprise primarily means ‘consist of’, as in the country comprises twenty states. It can also mean ‘constitute or make up a whole’, as in this single breed comprises 50 per cent of the Swiss cattle population. When this sense is used in the passive (as in the country is comprised of twenty states), it is more or less synonymous with the first sense (the country comprises twenty states). This usage is part of standard English, but the construction comprise of, as in the property comprises of bedroom, bathroom, and kitchen, is regarded as incorrect.

(Ted Kremenek) #12

@Michael_Ilseman and I somehow missed following up on this. A change has been pushed to the blog post; it should show up shortly.

1 Like