As I've been going through SBS doing clean-up work, I have noticed that our string benchmarks are in some places trying to cover wider range of scripts, but the selection of test strings is rather ad-hoc. Basically we test text written in Latin, Cyrillic and CJK scripts. That is all.
StringWalk
workload variants
ascii
, utf16
, tweet
, japanese
, chinese
, korean
, russian
, punctuated
, punctuatedJapanese
But it looks like we are relying on them to fine tune the String’s ABI and UTF-8 … I'm not so sure we are backing such crucial decision with enough data.
I propose we start migrating existing benchmarks and write all new string performance tests against a single text corpus that more systematically covers the various scripts represented in Unicode. I think that ideal document for this purpose is the Universal Declaration of Human Rights. Plethora of official translations is available from Unicode Consortium, to create a text corpus of semantically equivalent information in various scripts and languages. I believe that would enable us to make much more sensible relative performance comparisons between various scripts and unburden benchmark authors from the need to reinvent the wheel.
We can start small, with the two sentences from Article 1. If this proves useful, we can expand this in the future to include more (or all) articles, possibly even the preamble. I'm thinking this could start as simple string, and grow over time to include more articles and we could store the list of ranges for individual articles for extraction of smaller substrings.
On top of this we could build parsing tests, by having language specific parsing rules (strings to split the text into articles, paragraphs etc.), or string interpolation benchmarks (eg. HTML formatting — filling templates with articles and paragraphs).
Scripts and Languages
I think all questions around Unicode are potentially politically sensitive and we should be making decisions about strings very carefully and with all interested parties in mind. I hope you already did all this internally at Apple, and you can enlighten me in my ignorance. I wonder what is the impact of switching from UTF-16 to UTF-8 for languages whose scripts didn't draw lucky cards and are not located at the beginning of the Basic Multilingual Plane. How are they impacted by the switch to variable length encoding that strongly favors ASCII and Latin scripts?
The UDHR in Unicode contains, in addition to the super useful table of Translations that lists the used scripts and language codes, a page with Aggregates. I think that for our testing goal would be pretty well covered by choosing the most used scripts from the first article in all the scripts. We just need to estimate the number of language speakers that use a given script — we can probably skip the most esoteric ones.
I thought it might be useful to start from the List of languages by total number of speakers in Wikipedia, which draws on data from Ethnologue. Before trimming this down, I've compiled a bigger set of 29 samples.
For fun, and to maintain coverage parity with existing StrigWalk
, I have even created this expressionistic translation of Article 1 into Emoji + math symbols:
|
|
(Shoutout to @codafi for indulging my procrastinations about emoji and math notation.)
See the gist of the corpus prototype.
What's the performance impact of switching from UTF-16 to UTF-8, on string processing algorithms for languages written various scripts? Hard to say until we do proper benchmarking. I leave here the number of elements in .characters
, .unicodeScalars
, .utf-8
and .utf16
views. Given the amount of infomation int the text is the same, the utf8 count is the number of bytes this information get's represented as a combination of the language, script's unicode mapping and utf-8 encoding.
Element Counts and Encoded Size
lang | char | scal | utf16 | utf8 | UTF-16 (B) | UTF-8 (B) | 𝚫 | 𝚫% |
---|---|---|---|---|---|---|---|---|
eng | 170 | 170 | 170 | 170 | 340 | 170 | -170 | -50 % |
cmn | 43 | 43 | 43 | 125 | 86 | 125 | 39 | 45 % |
hin | 130 | 189 | 189 | 499 | 378 | 499 | 121 | 32 % |
spa | 171 | 171 | 171 | 173 | 342 | 173 | -169 | -49 % |
arb | 112 | 116 | 116 | 212 | 232 | 212 | -20 | -9 % |
fra | 186 | 186 | 186 | 191 | 372 | 191 | -181 | -49 % |
zlm | 200 | 200 | 200 | 200 | 400 | 200 | -200 | -50 % |
rus | 160 | 160 | 160 | 293 | 320 | 293 | -27 | -8 % |
ben | 118 | 168 | 168 | 452 | 336 | 452 | 116 | 35 % |
por | 174 | 174 | 174 | 180 | 348 | 180 | -168 | -48 % |
urd | 161 | 161 | 161 | 289 | 322 | 289 | -33 | -10 % |
deu | 164 | 164 | 164 | 166 | 328 | 166 | -162 | -49 % |
jpn | 85 | 85 | 85 | 255 | 170 | 255 | 85 | 50 % |
pnb | 194 | 194 | 194 | 344 | 388 | 344 | -44 | -11 % |
pan | 155 | 224 | 224 | 578 | 448 | 578 | 130 | 29 % |
pes | 146 | 146 | 146 | 261 | 292 | 261 | -31 | -11 % |
swh | 120 | 120 | 120 | 120 | 240 | 120 | -120 | -50 % |
jav | 89 | 151 | 151 | 453 | 302 | 453 | 151 | 50 % |
tel | 100 | 154 | 154 | 432 | 308 | 432 | 124 | 40 % |
tur | 160 | 160 | 160 | 171 | 320 | 171 | -149 | -47 % |
kor | 87 | 87 | 87 | 219 | 174 | 219 | 45 | 26 % |
mar | 127 | 189 | 189 | 509 | 378 | 509 | 131 | 35 % |
tam | 157 | 238 | 238 | 664 | 476 | 664 | 188 | 39 % |
vie | 183 | 215 | 215 | 279 | 430 | 279 | -151 | -35 % |
vieh | 45 | 45 | 52 | 142 | 104 | 142 | 38 | 37 % |
ita | 178 | 178 | 178 | 179 | 356 | 179 | -177 | -50 % |
hau | 222 | 222 | 222 | 229 | 444 | 229 | -215 | -48 % |
tha | 115 | 144 | 144 | 424 | 288 | 424 | 136 | 47 % |
emo | 55 | 56 | 76 | 130 | 152 | 130 | -22 | -14 % |
Does this approach make sense, of are we already covered elsewhere?