More Representative String Benchmarks

Michael_Ilseman · December 14, 2018, 6:29pm

+1 to everything @scanon said, here's some additional background.

Regarding the UDHR, it is a valuable source for exploring what is effectively the same prose in the broadest number of languages we can represent electronically. I have used it in several investigations regarding coverage of grapheme-breaking fast-paths, comparison and normalization fast-paths, etc. I also used GDP BY LANGUAGE, translations of TSPL and Wikipedia pages, etc. Each of these individually are poor proxies for electronic representation of text, especially in a performance-sensitive context. But combining them with an understanding of their inherent limitations has been valuable.

However, they are a counter-productive data set for benchmarking, for the reasons @scanon points out.

The main data set that we're under-benchmarking is multi-byte text with long runs of ASCII scattered throughout. This models many (most?) actual electronic text, as there's usually markup, etc., present.

The main operation we're missing coverage is hunting for small islands of ASCII in a sea of otherwise opaque data. This is the most common performance-sensitive way to consume a String. Now that String.UTF8View can provide direct access to the bytes for non-lazily-bridged-strings, this is more feasible. Character Literals will make it much more ergonomic.

This would be really useful and also good for testing. We need coverage of different "shapes" of a string's machine representation more than its linguistic properties. Shape includes its mixture of scalars including byte-width and normality, small vs large strings, long runs of shared prefix/suffix (for comparison early exists) with varying normality, native string vs lazily-bridged-contiguous-UTF-16 vs lazily-bridged-contiguous-ASCII vs lazily-bridged-discontiguous-UTF-16, etc.

Unfortunately, developing this keeps getting deferred due to ABI-critical fixes and changes.