Benchmark Categories

Hi swift-dev,

Joe Shajrawi is adding a categorization framework to our in-tree benchmark
suite (swift/benchmark). We're going to have a initial set of tags as defined below
(comments welcome). These are free-form tags, but they will fall into
a natural hierarchy. The purpose of tagging a benchmarks is:

- Help document a benchmark's intent and search for relevant
  benchmarks when changing the stdlib/runtime/optimizer.

- Quickly run a subset of benchmarks most relevant to a particular
  stdlib/runtime/compiler change.

- Document performance coverage. Any API, runtime call, or pattern
  considered important for general Swift performance should be
  explicitly represented among the "validation" suite.

- Track the performance of different kinds of benchmarks
  independently. For example, "regression" benchmarks are only useful
  for identifying performance regressions. They may not be highly
  applicable to general Swift performance. "validation" benchmarks are
  areas that we want to continually improve. A regression on a
  validation benchmark is potentially more serious than on a
  regression benchmark.

Note that we don't have "unit test" benchmarks. Specific compiler
transformations should be verified with lit tests. "Regression"
benchmarks are usually just a bit too complicated to rely solely on a
lit test.

--- Tags ---

#validation : These are "micro" benchmarks that test a specific
operation or critical path that we know is important to measure. (I
considered calling these #coverage, but don't want to confuse them
with code coverage efforts).

Within #validation we have:

   #api -> #Array, #String, #Dictionary, #Codable, etc.
   #runtime -> #refcount, #metadata, etc.

   #stable : additionally tag any validation tests that already have stable,
   reasonably optimized implementation.

#algorithm : These are "micro" benchmarks that test some well-known algorithm in isolation: sorting, searching, hashing, fibonaci, crypto, etc.

#miniapplication : These benchmarks are contrived to mimic some subset
of application behavior in a way that can be easily measured. They are
larger than micro-benchmarks, combining multiple APIs, data
structures, or algorithms. This includes small standardized
benchmarks, pieces of real applications that have been extracted into
a benchmark, important functionality like JSON parsing, etc.

#regression : Pretty much everything else. This could be a random
piece of code that was attached to a bug report. We want to make sure
the optimizer as a whole continues to handle this case, but don't know
how applicable it is to general Swift performance relative to the
other micro-benchmarks. In particular, these aren't weighted as highly
as "validation" benchmarks and likely won't be the subject of future
investigation unless they significantly regress.