Request for comments: Formal grammar in "The Swift Programming Language" using DocC

Issue 3 for "The Swift Programming Language" tracks the need to adapt how we write formal grammar, to pick a notation that's compatible with authoring in DocC.

When we were picking a notation for the launch of Swift 1, the most important consideration was that the grammar should be easy to read, even for someone who hadn’t previously worked with formal grammar. This focus on readability resulted in the following:

  • A very limited set of ways to combine the pieces of the grammar. Other notations allow for more complex operations like grouping and repetition — those make writing the grammar easier and they make the resulting grammar more compact, but at the cost of readability.

  • Use of styling differences to mark a piece’s role in the grammar. Other notations use approaches like quoting, escaping, sigils, and character case for this purpose. That choice allows them to be written in plain text, which is necessary for things like RFCs or plain-text email discussion, and grammars that are transformed into part of a computer program. However, again, it comes at the cost of readability.

  • Strong visual distinction between the pieces of the grammar. In most cases, there are two styling differences between parts with different meanings — for example, literals are in code voice and bold, nonterminals are non-bold and italic, and optionality is a subscript. That’s especially important for literals that are only one or two characters, which appear frequently in the grammar.

At that time, Swift wasn’t open source and Swift Evolution didn’t exist, so we didn’t give a focus to how easy it was to write. Experience since then has show that subscript “opt” is particularly hard to write in some important contexts. It isn’t available in the markdown dialect used for Swift Evolution proposals, or in the markup used by the Swift forums, which makes talking about changes to the grammar more difficult.

Are these still the right goals for TSPL’s grammar? Does this proposed notation meet these goals?

The differences between the current grammar and what I’m proposing as a DocC-compatible notation are:

  • Marking optionality with a question mark (?) in italics. The current notation uses subscript “opt” in gray italics. DocC doesn’t support subscripts. This is slightly less clear, especially for optional single-character literals. However, the style difference between bold code voice and italic non-bold non-code should make the postfix ? stand out from the literal. Likewise, it should distinguish a question mark as single-character literal from one that marks optionality. (Compare the screenshots below for optional-type and conditional-operator.)

  • Coloring all non-link text black. The current notation uses gray for the subscript “opt” and for the name of a syntactic category being defined. DocC doesn’t support changing the color of text.

  • Writing syntactic category names in italics. The current notation in HTML on Swift.org doesn’t italicize syntactic category names on the right-hand side of the arrow; the ePUB version does. The prose description of the grammar also says that syntactic category names appear in italics, so I think this was a tooling mistake that happened when migrating TSPL from developer.apple.com to Swift.org

  • Omitting the links to a syntactic category’s definition, temporarily. In the current notation, everywhere a syntactic category appears, it’s a live link that takes you to its definition. That makes it easy to navigate throug the grammar, especially in cases where you need to step through multiple productions. DocC support links only to headings; issue 48 is tracking this for TSPL and I’m writing up enhancement request for DocC.

The screenshots below illustrate this proposed approach, and the existing approach:




Screen Shot 2022-10-13 at 5.15.18 PM
Screen Shot 2022-10-13 at 5.17.30 PM

3 Likes

I find the grammar of Go programming language easier to read. It is also machine readable because it does not use bold or italic formatting to convey meta meaning.

Notation

The syntax is specified using a variant of Extended Backus-Naur Form (EBNF):

Syntax      = { Production } .
Production  = production_name "=" [ Expression ] "." .
Expression  = Term { "|" Term } .
Term        = Factor { Factor } .
Factor      = production_name | token [ "…" token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .
Productions are expressions constructed from terms and the following operators, in increasing precedence:

|   alternation
()  grouping
[]  option (0 or 1 times)
{}  repetition (0 to n times)

Concrete Example

int_lit        = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit    = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit     = "0" ( "b" | "B" ) [ "_" ] binary_digits .
octal_lit      = "0" [ "o" | "O" ] [ "_" ] octal_digits .
hex_lit        = "0" ( "x" | "X" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .
4 Likes

I like the idea of using asides with custom titles (implemented here) to group parts of the grammar.

Looking forward to your feature proposal here, a flexible system to define arbitrary anchors and link to them would be a nice addition for DocC in general.

1 Like

Unicode has LATIN SUBSCRIPT SMALL LETTER O, P, and T:

"ₒₚₜ" == "\u{2092}\u{209A}\u{209C}"

For macOS, they look better on GitHub (using the system font) than on these forums (using Arial).

> *getter-setter-block* → **`{`** [*getter-clause*]() [*setter-clause*]()*ₒₚₜ* **`}`**

getter-setter-block{ getter-clause setter-clauseₒₚₜ }

Other platforms may not have suitable fonts.

The alternatives I considered were variations on OPT and (opt) written after the token. For example:

image

The suffix -OPT form looks ok after a syntactic category name, but the hyphen closed up against the end of the code voice looks awkward. Removing the hyphen makes it a bit ambiguous whether "OPT" is a separate token or applies to the token before it.

Using (opt) has similar challenges, although to me the (opt) form could be a reasonable alternative to the ? spelling.

Because the grammar in TSPL only marks single tokes as optional, not groups of tokens, I didn't previously consider spellings like the square brackets used by EBNF. (Although we did consider that in the Swift 1 timeframe. One of the syntaxes for grammars that I reviewed was the Open COBOL manual, which used brackets for optionality and vertical stacking for alternation.) If we used that spelling, we could make a style rule that we still don't use grouping. Here's what that might look like:

image

@ibex10 wrote:

I find the grammar of Go programming language easier to read.

Can you share a little more detail? What makes it easier to read? Why do you prefer it?

Making the typeset version of TSPL machine-readable isn't a goal, because the documentation is for people. However, if you wanted to read the grammar in a program, you could still read the markdown source.

@benrimmington wrote:

Unicode has LATIN SUBSCRIPT SMALL LETTER O, P, and T:

Thanks for pointing that out, but as you noted they don't really look good. Using them as pseudo-subscript also doesn't line up with the rationale for their inclusion in Unicode or recommended usage:

Super and subscripted letters and digits are quite common in some forms of phonetic or phonemic transcriptions, where the use of styles is both awkward and prone to data integrity issues when exported to plain text. (source)

In some cases, they look even worse than your screenshot:

image

3 Likes

I feel that the use of a simple but elegant notation to specify optional or repeated elements makes those elements easy to spot while reading the grammar.

[]  option (0 or 1 times)
{}  repetition (0 to n times)
[Foo] // Foo may be omitted
{Foo} // Foo may be omitted or repeated

This is maybe because I am quite ancient. I wrote my first program in Fortran (actually punched it onto cards) in 1975 on a Univac system. I prefer simplicity, and I really care about how much storage and energy we humans use to store or transport pieces of information around the world. :slight_smile:

2 Likes

I'd second that, I prefer Go version better: for me it is easier to read and understand, and – not a small thing – much easier to write/represent in a plain-text editor. It's a variant of BNF/EBNF historically used in Pascal based languages, while C traditionally used this different variant with the subscript "opt".

PS. I did not program with punch cards :slight_smile:

1 Like

this may not be a welcome opinion, but i personally think it is more important to know if an audit has been done to ensure consistency with the real grammar implemented by swift-syntax, because i recall many of the production rules in TSPL are badly out-of-date.

i think this should take precedence over superficial concerns about font size and notation.

3 Likes

In my original post, I wrote:

Are these still the right goals for TSPL’s grammar? Does this proposed notation meet these goals?

Sorry, I think I phrased this a little too broadly. For this specific discussion, our priority is being able to publish TSPL using DocC. So maybe a better phrasing is this: Should we consider ease of writing the grammar as a goal? Do the changes/compromises I’m suggesting still meet the goals?

I don’t want to side track too far into rewriting the entire grammar with a different notation — that would delay the adoption of DocC, and isn’t really something we’d consider until after DocC-ification. But we can continue discuss it a bit here, to decide whether it’s an idea we want to come back to in the future. Many of the original Swift 1 era reasons to use this simpler grammar are still true today. BNF family notations for grammars allow more complex kinds of production rules, which means they take longer for a new reader to learn. I think the trade-off is worth it: the grammar becomes slightly more verbose for the most experienced readers, in service of making it possible for a larger audience of less-experienced folks to read the grammar. This goal is also why the grammar uses full English names like “superclass-expression” rather than abbreviated names like “expr-super”. For comparison, you can look at the “LangRef” grammar used during development of Swift 1.

The point @ibex10 raises about repetition is a valid one: After you learn the EBNF notation, it’s easy to spot these. The approach TSPL takes is one of convention — we use plural names for repetition without separators and names like foo-list when there are commas or similar. For reference, here’s that section in the TSPL style guide.

I don’t think writing the grammar in plain text is really a goal — this book has a full range of typographic styles available, and it should use them for readability. Whatever markdown we need to write to typeset the grammar will necessarily be well defined, meaning we can extract that part of the book and transform it into a machine-readable grammar if needed. There has been some separate discussion about using the same grammar in the new Swift parser and in TSPL, which dovetails neatly into that as wall.

@taylorswift I probably missed that original post because it’s in the forum area for development of the Swift compiler. This is absolutely an issue we should address. If you have a GitHub account, can you create an issue? Otherwise I’ll open one to track these errors.

PS I also haven’t programed with punch cards, but I enjoyed watching a training film for the IBM 029 keypunch, and I use ed(1) regularly.

1 Like

I've created a pull request to review this change:

1 Like