More Representative String Benchmarks

As I've been going through SBS doing clean-up work, I have noticed that our string benchmarks are in some places trying to cover wider range of scripts, but the selection of test strings is rather ad-hoc. Basically we test text written in Latin, Cyrillic and CJK scripts. That is all.

StringWalk workload variants

ascii, utf16, tweet, japanese, chinese, korean, russian, punctuated, punctuatedJapanese

But it looks like we are relying on them to fine tune the String’s ABI and UTF-8 … I'm not so sure we are backing such crucial decision with enough data.

I propose we start migrating existing benchmarks and write all new string performance tests against a single text corpus that more systematically covers the various scripts represented in Unicode. I think that ideal document for this purpose is the Universal Declaration of Human Rights. Plethora of official translations is available from Unicode Consortium, to create a text corpus of semantically equivalent information in various scripts and languages. I believe that would enable us to make much more sensible relative performance comparisons between various scripts and unburden benchmark authors from the need to reinvent the wheel.

We can start small, with the two sentences from Article 1. If this proves useful, we can expand this in the future to include more (or all) articles, possibly even the preamble. I'm thinking this could start as simple string, and grow over time to include more articles and we could store the list of ranges for individual articles for extraction of smaller substrings.

On top of this we could build parsing tests, by having language specific parsing rules (strings to split the text into articles, paragraphs etc.), or string interpolation benchmarks (eg. HTML formatting — filling templates with articles and paragraphs).

Scripts and Languages

I think all questions around Unicode are potentially politically sensitive and we should be making decisions about strings very carefully and with all interested parties in mind. I hope you already did all this internally at Apple, and you can enlighten me in my ignorance. I wonder what is the impact of switching from UTF-16 to UTF-8 for languages whose scripts didn't draw lucky cards and are not located at the beginning of the Basic Multilingual Plane. How are they impacted by the switch to variable length encoding that strongly favors ASCII and Latin scripts?

The UDHR in Unicode contains, in addition to the super useful table of Translations that lists the used scripts and language codes, a page with Aggregates. I think that for our testing goal would be pretty well covered by choosing the most used scripts from the first article in all the scripts. We just need to estimate the number of language speakers that use a given script — we can probably skip the most esoteric ones.

I thought it might be useful to start from the List of languages by total number of speakers in Wikipedia, which draws on data from Ethnologue. Before trimming this down, I've compiled a bigger set of 29 samples.

For fun, and to maintain coverage parity with existing StrigWalk, I have even created this expressionistic translation of Article 1 into Emoji + math symbols:

:one:

:bust_in_silhouette::busts_in_silhouette:,:pregnant_woman::arrow_right::baby:: :bust_in_silhouette::statue_of_liberty:
:bust_in_silhouette::bust_in_silhouette:: {:handshake:,:scroll:}
:busts_in_silhouette: = {:bust_in_silhouette: | :brain:, :angel::thinking::imp:}
:bust_in_silhouette::bust_in_silhouette:: :two_men_holding_hands::ghost:

Article 1

All human beings are born free and
equal in dignity and rights.
They are endowed with reason and conscience and
should act towards one another in a spirit of brotherhood.

(Shoutout to @codafi for indulging my procrastinations about emoji and math notation.)

See the gist of the corpus prototype.

What's the performance impact of switching from UTF-16 to UTF-8, on string processing algorithms for languages written various scripts? Hard to say until we do proper benchmarking. I leave here the number of elements in .characters, .unicodeScalars, .utf-8 and .utf16 views. Given the amount of infomation int the text is the same, the utf8 count is the number of bytes this information get's represented as a combination of the language, script's unicode mapping and utf-8 encoding.

Element Counts and Encoded Size
lang char scal utf16 utf8 UTF-16 (B) UTF-8 (B) 𝚫 𝚫%
eng 170 170 170 170 340 170 -170 -50 %
cmn 43 43 43 125 86 125 39 45 %
hin 130 189 189 499 378 499 121 32 %
spa 171 171 171 173 342 173 -169 -49 %
arb 112 116 116 212 232 212 -20 -9 %
fra 186 186 186 191 372 191 -181 -49 %
zlm 200 200 200 200 400 200 -200 -50 %
rus 160 160 160 293 320 293 -27 -8 %
ben 118 168 168 452 336 452 116 35 %
por 174 174 174 180 348 180 -168 -48 %
urd 161 161 161 289 322 289 -33 -10 %
deu 164 164 164 166 328 166 -162 -49 %
jpn 85 85 85 255 170 255 85 50 %
pnb 194 194 194 344 388 344 -44 -11 %
pan 155 224 224 578 448 578 130 29 %
pes 146 146 146 261 292 261 -31 -11 %
swh 120 120 120 120 240 120 -120 -50 %
jav 89 151 151 453 302 453 151 50 %
tel 100 154 154 432 308 432 124 40 %
tur 160 160 160 171 320 171 -149 -47 %
kor 87 87 87 219 174 219 45 26 %
mar 127 189 189 509 378 509 131 35 %
tam 157 238 238 664 476 664 188 39 %
vie 183 215 215 279 430 279 -151 -35 %
vieh 45 45 52 142 104 142 38 37 %
ita 178 178 178 179 356 179 -177 -50 %
hau 222 222 222 229 444 229 -215 -48 %
tha 115 144 144 424 288 424 136 47 %
emo 55 56 76 130 152 130 -22 -14 %

Does this approach make sense, of are we already covered elsewhere?

2 Likes

You're dancing around a very good direction here, but I want to try to redirect you ever so slightly. First, strictly from the standpoint of evaluating the performance impact of changes to the representation of strings in Swift, really the only consideration of a test corpus is having an appropriate sample of distributions of 1, 2, 3 and 4 byte scalars. From this perspective, the 29 samples you suggest here are basically only three distinct tests. You can see this very clearly if you plot the results:

I would note that each of these three clusters is pretty well-represented by the samples that Michael has been using for his work, and note that he has a few other string "shapes" covered as well, which are really important for real world work. Which brings me to my next point.

A paragraph of prose written entirely in the characters of a single language in formal style is the exception, not the rule. I love the sentiment of using the UDHR, but it's not representative of the way that people use language, and human use of written language is not representative of text. Real text has formatting, markup or other metadata, mixes multiple languages, is interspersed with emoji. It's messy.

Which isn't to say that we shouldn't be building and improving the corpus of test cases, and including more languages is a vital part of that. But, for the work that Michael and Karoy have been doing so far, having a representative mix of character distributions has been much more critical. Building a representative corpus is a long-term project, probably outside the scope of Swift, and mostly outside of our expertise (my one undergraduate course in corpus linguistics only carries me so far). It's a large project, and I would suggest delegating to linguistic experts and using existing corpora rather than trying to create our own.

A more immediately useful direction for benchmarking would be to set up a string generator that can produce synthetic string samples given a specified encoding distribution and length. This would actually get us a good representative sample--not of human language, but of performance categories of encoding distributions, and also allow us to do things like searching for worst-cases for performance.

11 Likes

+1 to everything @scanon said, here's some additional background.

Regarding the UDHR, it is a valuable source for exploring what is effectively the same prose in the broadest number of languages we can represent electronically. I have used it in several investigations regarding coverage of grapheme-breaking fast-paths, comparison and normalization fast-paths, etc. I also used GDP BY LANGUAGE, translations of TSPL and Wikipedia pages, etc. Each of these individually are poor proxies for electronic representation of text, especially in a performance-sensitive context. But combining them with an understanding of their inherent limitations has been valuable.

However, they are a counter-productive data set for benchmarking, for the reasons @scanon points out.

The main data set that we're under-benchmarking is multi-byte text with long runs of ASCII scattered throughout. This models many (most?) actual electronic text, as there's usually markup, etc., present.

The main operation we're missing coverage is hunting for small islands of ASCII in a sea of otherwise opaque data. This is the most common performance-sensitive way to consume a String. Now that String.UTF8View can provide direct access to the bytes for non-lazily-bridged-strings, this is more feasible. Character Literals will make it much more ergonomic.

This would be really useful and also good for testing. We need coverage of different "shapes" of a string's machine representation more than its linguistic properties. Shape includes its mixture of scalars including byte-width and normality, small vs large strings, long runs of shared prefix/suffix (for comparison early exists) with varying normality, native string vs lazily-bridged-contiguous-UTF-16 vs lazily-bridged-contiguous-ASCII vs lazily-bridged-discontiguous-UTF-16, etc.

Unfortunately, developing this keeps getting deferred due to ABI-critical fixes and changes.

7 Likes

The impression I got from your descriptions of the desired generator is that the average character frequencies (by script membership or encoded scalar length) are less important, but the priority would be to produce realistically chunky text, e.g. ASCII email header, followed by HTML formatted text in different scripts.

Are there any examples of text generators, preferably online, to model what you're after here? Especially when it get's to API or configuration interface of the generator. I'm guessing WeirdTextGenerator is not it… :stuck_out_tongue_winking_eye::crazy_face: I'll see what I can do in the new year about this. (My current priority is finishing the SBS cleanup pass.)

1 Like