Pitch: mathematical typesetting in DocC

TL;DR

I propose a new DocC directive, @Math:

/// The sample variance of the collection.
/// - Returns: The sample variance:
/// @Math("sample-variance.xml", description: "Sum, from i = 1 to n, of the squared norm of x_i minus mu. Everything divided by n minus 1.")
/// where 𝑛 is the collection's `count` and ‖𝑥ᵢ - 𝜇‖ is the Euclidean distance from each element 𝑥ᵢ to the sample mean 𝜇.
func sampleVariance …

Which would output something like this:

Pitch

Movitation

Documentation pages often need to include mathematical expressions. For example, Apple's documentation pages for Accelerate (which are made with DocC) are enriched by equations and matrices that help clarify the documentation's text.

Although adding mathematical expressions to DocC documentation is already possible, all current approaches are insufficient.

  • Unicode math is usually enough for expressions containing a single row, such as ‖𝑥ᵢ - 𝜇‖². But expressions containing multiple rows tend to be unreadable or to not look good in Unicode math. For instance, the best we can do for the expression in the TL;DR above is ¹⁄₍ₙ₋₁₎ ∑ᵢ₌₁ⁿ ‖𝑥ᵢ - 𝜇‖². And, many times, multi-row expressions are impossible to represent in Unicode math. For example, there are no superscript Greek letters in Unicode. Matrices are also impossible.
  • We can compile typeset math (.tex for LaTeX, or .xml for MathML) to an image (say, .png or .svg). This requires 2-3 files that must be kept in sync: the image in light mode, the image in dark mode, and optionally (but ideally) the source file. This is the approach that the Accelerate docs use (sans the dark mode support).
  • We can use this trick, which is what I personally prefer, but it comes with disadvantages of its own (detailed in the linked post).

Therefore, first-class support for mathematical typesetting would be a welcome addition to DocC. There are many different ways of going about this, so I'm making this post for us to discuss potential solutions. I’ll include my personal preference, then some alternative approaches.

Proposed solution

Keep it simple: write MathML, output MathML. DocC outputs a webpage; I think we should embrace this fact and stick to web standards when possible. MathML's syntax is more unwieldy than LaTeX's, but this solution:

  • Adds no dependencies, e.g. on MathJax or KaTeX.
  • Avoids the performance concerns of compiling LaTeX to SVG (or LaTeX to MathML, or MathML to SVG) at runtime. Concerning the first two cases (LaTeX to SVG and LaTeX to MathML), MathJax is infamous for the lag it can cause when there are many equations on-screen.
  • Adapts to light/dark mode with no extra work, since MathML elements use the current font color. Try it out: inspect a DocC webpage, add a <math> with some MathML, then toggle between light mode and dark mode.
  • Is the easiest to implement and maintain. After checking that the source MathML is valid, DocC would just have to paste it unmodified into the webpage.

Also, any approach in which the output is MathML (regardless of whether the input is MathML or LaTeX) will have great accessibility, since users can navigate a <math> equation with a screen reader. This is better than just having an alt text, which is the best we can do until DocC supports mathematical typesetting. (That said, MathJax's support for screen readers is also very good.)

As a future direction, we can consider also supporting LaTeX as a source language. If we do, I believe we should compile the .tex to MathML (not to SVG, PNG, etc.) for consistency and for best accessibility. Also, we should compile the source LaTeX to MathML while compiling the documentation (i.e. not at runtime via MathJax or KaTeX) for lighter documentation webpages and to avoid performance concerns.

Alternatives considered

In-source MathML instead of writing the MathML on an separate file

Swift Markdown (used by DocC) recognizes some HTML tags since it uses GitHub-Flavored Markdown. But the HTML tags in the GHFM spec don't include any MathML tags, and even if they did, DocC ignores any HTML in the documentation comments. So we could change Swift Markdown to also recognize MathML, then change DocC to not ignore MathML tags as an exception to its "no HTML" rule.

This approach would be needlessly complicated and would require a Swift package that has nothing to do with DocC to break from its spec. It would also lead to bloated documentation comments:

/// The sample variance of the collection.
/// - Returns: The sample variance:
/// <math>
///    <mstyle displaystyle="true">
///        <mfrac>
///            <mn>1</mn>
///            <mrow>
///                <mi>n</mi>
///                <mo>-</mo>
///                <mn>1</mn>
///            </mrow>
///        </mfrac>
///        <mspace width="5px"/>
///        ...
/// </math>
/// where 𝑛 is the collection's `count` and ‖𝑥ᵢ - 𝜇‖ is the Euclidean distance from each element 𝑥ᵢ to the sample mean 𝜇.
func sampleVariance …

Make a DocC-specific language for in-source mathematical typesetting

Way too complicated, and would also lead to bloated documentation comments. We could alternatively write a Swift DSL similar to John Sundell's Plot that wraps MathML instead of HTML, which would make MathML less annyoing to write, then typeset documentation math on a separate .swift file using the DSL. But DocC is intended to be language-agnostic, and this approach would still be more complex than necessary.

Write the math in MathML but compile it to SVG instead of using <math>

Mostly already covered. This could lead to insufficient accessibility, and we'd have to either:

  • Make DocC dependent on an existing MathML-to-SVG compiler.
  • Write our own.

This approach's motivation is the concern that MathML has insufficient cross-browser support. While true historically, MathML Core is now supported by all major browsers.

Use an extension other than .xml for the MathML files

We have three options: .html, .mathml, .xml.

Extension Advantage Disadvantage
.html Documentation writers can preview their MathML while writing it: open the .html file on your browser, use your code editor's "preview HTML" functionality, or (if you're on macOS) use Quick Look. Also, .html has syntax highlighting in any code editor. If a file only contains MathML, it wouldn't be quite correct for it to be .html because HTML files must begin with a doctype and be wrapped in <html>. And, if DocC were to expect a .html file for typeset math, that would incorrectly suggest to developers that the file would be allowed to contain non-MathML HTML. Also, code editors would autocomplete with disallowed tags.
.mathml Most descriptive extension. Lets writers know exactly what is allowed in the file. Not a real extension. Anyone who tries opening the file would be greeted with something like "There is no application set to open the document". No syntax highlighting.
.xml Like .html, code editors would know how to open it and would provide syntax highlighting. No preview capabilities, now or ever.

Given the trade-offs above, I believe .xml is the best choice. The lack of file previews is not a big deal thanks to DocC's preview-documentation feature.

8 Likes

Please allow me to cite myself:

Also, you would like to write the formulas directly into the documentation file during editing and also keep them in the compiled files without any separate files — imagine a documentation page with many formulas where the formulas should be rendered fast, compare LaTeX + the KaTeX library which is really fast — BTW faster than the (much bigger) MathJax library. KaTeX is also so small it can easily be embedded into every HTML output (if LaTeX formulas are used).

Note that LaTeX formulas (if they are not too complicated) are also quite readable when you just look at their code, no need to add a separate description, just write them (with appropriate separators) into the documentation source “and you are done”. And in the case of looking at a produced HTML with an according library like KaTeX missing, you still see something that might make sense to you.

11 Likes

Big +1

Just want to say that I like the idea of starting with a MathML based solution as you propose since it would have some key benefits of being a fairly compatible web standard that wouldn't require additional dependencies or build logic that DocC needs to introduce in order to parse and render things appropriately.

I agree that it's probably not ideal to try and mix the MathML and markdown syntax together directly, although that does have the drawback of not being able to read the math inline with the documentation text. Using a special directive seems like it might be a reasonable approach to me all things considered though.

Maybe this new directive could be flexible enough to one day also support alternative input languages like the LaTeX syntax that is easier to write, assuming it's popular enough to warrant the additional compiler logic/dependencies needed to make something like that work.

These aren't very strongly held opinions, but I just wanted to comment on your proposed approach. Thanks for writing it up!

1 Like

I agree with putting MathML in an external file. It's too verbose and hard to read for it to be useful to put it inline in the DocC text. If we needed inline equations then LaTeX would be the way to go in terms of readability (and it might be a useful future extension) but for most purposes an external MathML file is fine.

People may of course want to generate those MathML files from LaTeX source but that can be handled by external programs.

I strongly agree with rendering the output in MathML regardless of whether MathML or LaTeX syntax is used for input. Support in browsers has been improving significantly and this choice seems like it would offer the best possible case for accessibility.

1 Like

if a documentation compiler had the ability to transcode LaTeX to MathML, would there ever be any reason to have external documentation files in the first place?

my cursory search of LaTeX transcoders did not reveal any c or swift language implementations, but there do appear to exist some tools written in other languages that compile to executables, and a documentation compiler written in swift could distribute the binaries and invoke them through the OS.

You can use KaTeX or MathJax to output HTML:

  • Server side rendering: KaTeX produces the same output regardless of browser or environment, so you can pre-render expressions using Node.js and send them as plain HTML.

But node,js is maybe a showstopper here.

But as I have written elsewhere, KaTeX is really small and so can actually be integrated in an HTML output.

For people who don't speak LaTeX, MathML is a good choice as there are GUI tools for generating it. (Even Microsoft Word's equation editor can be used in a pinch to generate MathML.) It's too unreadable to want to embed it, so I'd probably put it in external file.

Obviously for those of us who have degrees in math and had to write papers in LaTeX we'd probably prefer to just write $\sum_{n=1}^{\infty} 2^{-n}=1$ or whatever rather than faffing about with MathML. But I'm not sure there's enough demand for that to justify needing anything more than the ability to transcribe MathML from an input file to the DocC output.

I think the demand exists. Nobody who can write LaTeX (and it is the standard for math formulas in the academic world) would ever like to use MathML, so you loose the audience who would actually be interested in adding math formulas in the first place.

I think LaTeX is clearly a ”must have” and MathML a ”nice to have”.

Update: As a side note, compare one of the more famous Visual Studio Code extensions for editing markdown which includes the ability to display math formulas.

1 Like

You should then not write more than a few formulas. But give people a tool and they might use it extensively.

I agree that it's probably not ideal to try and mix the MathML and markdown syntax together directly, although that does have the drawback of not being able to read the math inline with the documentation text.

Solving this problem is one of the reasons the proposed directive has a description parameter; in the example in the post’s TLDR, the description describes the mathematical expression for the in-source documentation readers. The other reason is accessibility: the description would be used as an alt text for the <math>.

Maybe this new directive could be flexible enough to one day also support […] LaTeX.

Yup, that’s the idea. Its syntax is more friendly, and like @sspringer mentioned, it’s the standard for mathematical typesetting. (Though, for the reasons outlined in the post, I agree with you that this is best left as a future direction instead of the initial solution.)

[LaTeX] is the standard for math formulas in the academic world.

This is a good point. Ideally, both would be supported. It’s just that it’s trivial to get good accessibility, no extra dependenides, dark mode support, no runtime compiling, no “equation source appears for a split second then the real equation appears” (common in MathOverflow) with MathML. All of these are also possible with LaTeX, just harder.

1 Like

This is also something that I (at least for now) strongly believe in. If there’s a web standard for mathematical typesetting, and Chromium/WebKit/Gecko all support it, then embracing it would be the forward-thinking solution, even if this support still has room to grow.

1 Like

You see this with MathJax, but not with KaTeX. I output HTML pages in my processes with a lot of LaTeX formulas (a few hundred on a page) + KaTeX, I do not (!) see any delay when viewing them, I only see the rendered math formulas without any flickering.

Also, as I have written elsewhere, for a good MathML rendering, when using MathML in the source (not converting them for the output) you have to use e.g. MathJax for rendering them, and so you end up with exactly the effect that you describe when using MathML.

Let me formulate this the following way: If someone wants to add a math formula inside a documentation, this person is likely to have an according academic background (and maybe has learnt how to think about the complexity of some algorithm), and this person then would like to write LaTeX of course. If only MathML is possible, this person would rather not add any formula.

1 Like

@sspringer Re: KaTeX: that’s good to hear. But instead of using MathJax or KaTeX for LaTeX support, why don’t we compile the LaTeX source during the documentation compilation stage? As fast as KaTeX may be, we could avoid compiling the LaTeX at runtime entirely; and as slim as KaTeX may be, we could avoid including it in the webpage entirely (for slightly faster loads and one less dependency).

Yes, might be even better to “pre-render” them (i.e. outputting HTML+CSS or SVG, maybe using KaTeX or MathJax). But outputting MathML is not the solution. MathML would have also have to be pre-rendered, as the MathML support by the browsers is still not good. And for efficiency you would have to keep these outputs inside the main file, not (!) using an external file for each formula.

So if you would like to keep the source of your HTML file succint and small, it would be better to use LaTeX in the HTML + KaTeX. If you do not bother that your HTML source gets ugly (who bothers anyway) and big (but on the other hand no additional JavaScript), pre-rendering might be a very good idea (you might still need some CSS).

Update: Both MathJax and KaTeX use their own fonts (which is quite important for good math rendering). So if formulas are to be precompiled, the font issue has to be evaluated, as this would be a dependency that one would like to avoid. Maybe some real examples / implementations would be a good idea at this point.

Update 2:

For MathJax (and the same for any other JavaScript based math rendering library) concerning fonts:

Since browsers do not provide APIs to access font metrics, MathJax has to ship with the necessary font data; this font data is generated during development and cannot be generated on the fly. In addition, most fonts do not cover the relevant characters for mathematical layout. Finally, some fonts (e.g. Cambria Math) store important glyphs outside the Unicode range, making them inaccessible to JavaScript. These are the main reasons why MathJax is unable to support arbitrary fonts at this time.

Even if no JavaScript is used, you need to control the font used which is a problem when using HTML without delivering the according fonts.The web standard fonts are not sufficient for math. So anyway you have to include the fonts.

Just to make sure I’m following you on this point:

  • By “output”, you mean the SVG or HTML+CSS or MathML (whichever we end up going with) generated by compiling/prerendering the LaTeX or MathML, correct?
  • By “keep these outputs inside the main file”, you mean including the output inside the HTML source generated by DocC (as opposed to using e.g. <img src="equation.svg" if the output is SVG), correct?

If so, then yes I agree. But why can’t the output be MathML instead of SVG? Again, I could be wrong, but (according to MDN and caniuse) all major browsers support MathML Core.

Yes, if you accept sub-par rendering results (test page, and another test page) or hope that this might improve over time. A mathematician has a hard time looking at those bad renderings, they remind me of cheaply typeset books before there was LaTeX. It is so bad that for many years and also currently I cannot use the native browser support for MathML for my clients (even with MathML support now enabled in Chrome). There have not been much improvement in a long period of time now.

(Also note that to my knowledge only MathML presentation markup is supported by browsers and not content markup, but this should be OK.)

Remember what was posted on the Chromium bug site in 2013: "MathML is not something that we want at this time. We believe the needs of MathML can be sufficiently met by libraries like MathJax and doesn't need to be more directly supported by the platform."

There are good arguments for native MathML support by the browsers, but I have yet to see some more commitment here. The Chromium support was obviously only activating a switch (the MathML support stemming from WebKit) and no new implementation. (Does anybody know about according statements?)

Of course, in the case of DocC its community might choose to use the native MathML support by the browsers (or web components) despite its bad implementation. Personally I think integrating KaTeX is so easy... But yes, compromises have to be made sometimes.

Update:

compromises have to be made sometimes

The more I think about it, the more I would say: No compromise, I want my DocC output to be beautiful (that’s maybe Apple’s fault with all their emphasis on aesthetics :wink:). So no native MathML rendering, please. Sorry.

1 Like

Yes.

Just one more note about integrating an according library vs. “pre-compilation” e.g. to HTML+CSS — the following maybe was not clear because I first missed an important point (the fonts):

If not using native MathML rendering (because native MathML rendering is quite bad, see above), even when you “pre-compile” the formulas, you will need to embed some fonts to display them. So you then always end up with a “dependency”. And considering that there is always a dependency, it seems sensible to at least leave the LaTeX formulas in their original form in the HTML and embed KaTeX to render LaTeX.

Because of the small size (even including the fonts) of KaTeX and because its rendering is really fast, this is a very good solution for LaTeX.

For MathML you would have to use MathJax, which is much bigger (even if you leave-out the parts of MathJax that you do not need) and slower, so not a conversion from LaTeX to MathML is needed which was suggested in some comments, but conversely a conversion from MathML to LaTeX would be a much better idea, and for that conversion we have a Swift package in development — if my partners agree on open-sourcing it, this could be a good solution.