More Representative String Benchmarks

palimondo · December 14, 2018, 12:57am

As I've been going through SBS doing clean-up work, I have noticed that our string benchmarks are in some places trying to cover wider range of scripts, but the selection of test strings is rather ad-hoc. Basically we test text written in Latin, Cyrillic and CJK scripts. That is all.

StringWalk workload variants

ascii, utf16, tweet, japanese, chinese, korean, russian, punctuated, punctuatedJapanese

But it looks like we are relying on them to fine tune the String’s ABI and UTF-8 … I'm not so sure we are backing such crucial decision with enough data.

I propose we start migrating existing benchmarks and write all new string performance tests against a single text corpus that more systematically covers the various scripts represented in Unicode. I think that ideal document for this purpose is the Universal Declaration of Human Rights. Plethora of official translations is available from Unicode Consortium, to create a text corpus of semantically equivalent information in various scripts and languages. I believe that would enable us to make much more sensible relative performance comparisons between various scripts and unburden benchmark authors from the need to reinvent the wheel.

We can start small, with the two sentences from Article 1. If this proves useful, we can expand this in the future to include more (or all) articles, possibly even the preamble. I'm thinking this could start as simple string, and grow over time to include more articles and we could store the list of ranges for individual articles for extraction of smaller substrings.

On top of this we could build parsing tests, by having language specific parsing rules (strings to split the text into articles, paragraphs etc.), or string interpolation benchmarks (eg. HTML formatting — filling templates with articles and paragraphs).

Scripts and Languages

I think all questions around Unicode are potentially politically sensitive and we should be making decisions about strings very carefully and with all interested parties in mind. I hope you already did all this internally at Apple, and you can enlighten me in my ignorance. I wonder what is the impact of switching from UTF-16 to UTF-8 for languages whose scripts didn't draw lucky cards and are not located at the beginning of the Basic Multilingual Plane. How are they impacted by the switch to variable length encoding that strongly favors ASCII and Latin scripts?

The UDHR in Unicode contains, in addition to the super useful table of Translations that lists the used scripts and language codes, a page with Aggregates. I think that for our testing goal would be pretty well covered by choosing the most used scripts from the first article in all the scripts. We just need to estimate the number of language speakers that use a given script — we can probably skip the most esoteric ones.

I thought it might be useful to start from the List of languages by total number of speakers in Wikipedia, which draws on data from Ethnologue. Before trimming this down, I've compiled a bigger set of 29 samples.

For fun, and to maintain coverage parity with existing StrigWalk, I have even created this expressionistic translation of Article 1 into Emoji + math symbols:

∀ ∈ ,: ∈
≡ : {,}
= { | , }
∼ :

Article 1

All human beings are born free and
equal in dignity and rights.
They are endowed with reason and conscience and
should act towards one another in a spirit of brotherhood.

(Shoutout to @codafi for indulging my procrastinations about emoji and math notation.)

See the gist of the corpus prototype.

What's the performance impact of switching from UTF-16 to UTF-8, on string processing algorithms for languages written various scripts? Hard to say until we do proper benchmarking. I leave here the number of elements in .characters, .unicodeScalars, .utf-8 and .utf16 views. Given the amount of infomation int the text is the same, the utf8 count is the number of bytes this information get's represented as a combination of the language, script's unicode mapping and utf-8 encoding.

Element Counts and Encoded Size

lang	char	scal	utf16	utf8	UTF-16 (B)	UTF-8 (B)	𝚫	𝚫%
eng	170	170	170	170	340	170	-170	-50 %
cmn	43	43	43	125	86	125	39	45 %
hin	130	189	189	499	378	499	121	32 %
spa	171	171	171	173	342	173	-169	-49 %
arb	112	116	116	212	232	212	-20	-9 %
fra	186	186	186	191	372	191	-181	-49 %
zlm	200	200	200	200	400	200	-200	-50 %
rus	160	160	160	293	320	293	-27	-8 %
ben	118	168	168	452	336	452	116	35 %
por	174	174	174	180	348	180	-168	-48 %
urd	161	161	161	289	322	289	-33	-10 %
deu	164	164	164	166	328	166	-162	-49 %
jpn	85	85	85	255	170	255	85	50 %
pnb	194	194	194	344	388	344	-44	-11 %
pan	155	224	224	578	448	578	130	29 %
pes	146	146	146	261	292	261	-31	-11 %
swh	120	120	120	120	240	120	-120	-50 %
jav	89	151	151	453	302	453	151	50 %
tel	100	154	154	432	308	432	124	40 %
tur	160	160	160	171	320	171	-149	-47 %
kor	87	87	87	219	174	219	45	26 %
mar	127	189	189	509	378	509	131	35 %
tam	157	238	238	664	476	664	188	39 %
vie	183	215	215	279	430	279	-151	-35 %
vieh	45	45	52	142	104	142	38	37 %
ita	178	178	178	179	356	179	-177	-50 %
hau	222	222	222	229	444	229	-215	-48 %
tha	115	144	144	424	288	424	136	47 %
emo	55	56	76	130	152	130	-22	-14 %

Does this approach make sense, of are we already covered elsewhere?

scanon · December 14, 2018, 6:03pm

You're dancing around a very good direction here, but I want to try to redirect you ever so slightly. First, strictly from the standpoint of evaluating the performance impact of changes to the representation of strings in Swift, really the only consideration of a test corpus is having an appropriate sample of distributions of 1, 2, 3 and 4 byte scalars. From this perspective, the 29 samples you suggest here are basically only three distinct tests. You can see this very clearly if you plot the results:

I would note that each of these three clusters is pretty well-represented by the samples that Michael has been using for his work, and note that he has a few other string "shapes" covered as well, which are really important for real world work. Which brings me to my next point.

A paragraph of prose written entirely in the characters of a single language in formal style is the exception, not the rule. I love the sentiment of using the UDHR, but it's not representative of the way that people use language, and human use of written language is not representative of text. Real text has formatting, markup or other metadata, mixes multiple languages, is interspersed with emoji. It's messy.

Which isn't to say that we shouldn't be building and improving the corpus of test cases, and including more languages is a vital part of that. But, for the work that Michael and Karoy have been doing so far, having a representative mix of character distributions has been much more critical. Building a representative corpus is a long-term project, probably outside the scope of Swift, and mostly outside of our expertise (my one undergraduate course in corpus linguistics only carries me so far). It's a large project, and I would suggest delegating to linguistic experts and using existing corpora rather than trying to create our own.

A more immediately useful direction for benchmarking would be to set up a string generator that can produce synthetic string samples given a specified encoding distribution and length. This would actually get us a good representative sample--not of human language, but of performance categories of encoding distributions, and also allow us to do things like searching for worst-cases for performance.

Michael_Ilseman · December 14, 2018, 6:29pm

+1 to everything @scanon said, here's some additional background.

Regarding the UDHR, it is a valuable source for exploring what is effectively the same prose in the broadest number of languages we can represent electronically. I have used it in several investigations regarding coverage of grapheme-breaking fast-paths, comparison and normalization fast-paths, etc. I also used GDP BY LANGUAGE, translations of TSPL and Wikipedia pages, etc. Each of these individually are poor proxies for electronic representation of text, especially in a performance-sensitive context. But combining them with an understanding of their inherent limitations has been valuable.

However, they are a counter-productive data set for benchmarking, for the reasons @scanon points out.

The main data set that we're under-benchmarking is multi-byte text with long runs of ASCII scattered throughout. This models many (most?) actual electronic text, as there's usually markup, etc., present.

The main operation we're missing coverage is hunting for small islands of ASCII in a sea of otherwise opaque data. This is the most common performance-sensitive way to consume a String. Now that String.UTF8View can provide direct access to the bytes for non-lazily-bridged-strings, this is more feasible. Character Literals will make it much more ergonomic.

This would be really useful and also good for testing. We need coverage of different "shapes" of a string's machine representation more than its linguistic properties. Shape includes its mixture of scalars including byte-width and normality, small vs large strings, long runs of shared prefix/suffix (for comparison early exists) with varying normality, native string vs lazily-bridged-contiguous-UTF-16 vs lazily-bridged-contiguous-ASCII vs lazily-bridged-discontiguous-UTF-16, etc.

Unfortunately, developing this keeps getting deferred due to ABI-critical fixes and changes.

palimondo · December 14, 2018, 8:18pm

The impression I got from your descriptions of the desired generator is that the average character frequencies (by script membership or encoded scalar length) are less important, but the priority would be to produce realistically chunky text, e.g. ASCII email header, followed by HTML formatted text in different scripts.

Are there any examples of text generators, preferably online, to model what you're after here? Especially when it get's to API or configuration interface of the generator. I'm guessing WeirdTextGenerator is not it… I'll see what I can do in the new year about this. (My current priority is finishing the SBS cleanup pass.)