DocC colons in filenames

Karl · September 11, 2022, 5:56pm

Whilst URL readability is nice, I don't believe it should outweigh all other considerations.

Consider that DocC already replaces many characters with _, and some components may include suffixes with hashes, neither of which are particularly readable. For example, the < and > operators provided by the Comparable protocol appear as _(_:_:)-74jbv and _(_:_:)-3s5ym right now. Can you tell which one is which?

If you look at the documentation for Int's operators, & is allowed (&(_:_:)-8h2q8), but | and ^ are substituted (_(_:_:)-26x3w and _(_:_:)-591r5). Of course it also has comparison operators, so there are actually quite a lot of _(_:_:)-blah paths (as well as minor variants, such as _=(_:_:)-blah and __(_:_:)-blah).

So whether you get a path component like *(_:_:) or _(_:_:)-74jbv seems to be a kind of accident that happens sometimes, and is not at all reliable. Readability is nice, but it is already so limited that I don't think we should cling on too desperately to the couple of examples where a handful of special characters allow for a marginal readability win at the expense of portability.

When it comes to writing URLs, If I'm entering a documentation URL in the browser, I'm generally either looking for something on developer.apple.com or one of my own projects (relying on the browser's autocomplete to fill in the first half of the URL). I know the path starts with /documentation, then a module name (say, /uikit), then a type (perhaps /uicollectionview). That works, and it's a nice feature, and it should continue to work; but if I'm looking for something more specific, or something involving special characters, I think it's better to use search instead. Nobody wants to write out paths like *(_:_:), and if it includes hashes or substitutions, you can't even know what the path component is.

They wouldn't. The contents of path components are opaque; only Punycode in the hostname should be decoded. Besides, we would need to customise some parameters of the bootstring encoding (it does not generally encode ASCII code-points at all), which means browsers wouldn't be able to decode it.

The benefits of a Bootstring encoding (of which Punycode is one example) is that it

allows a string of basic code points to uniquely represent any string of code points drawn from a larger set.

Every unique input produces a unique result
It is reversible (provided you know the parameters it was encoded with)
The "basic code points" can be defined arbitrarily - we can allow ASCII alphas and digits, maybe * if that's important, etc. It's entirely up to us to decide which characters are allowed.
Basic code points in the original string are left unchanged, so for the common case where we have ASCII identifiers, we can maintain high readability and avoid percent-encoding.
It is very compact for real-world Unicode text.

The way Bootstring encodes its data is with a series of variable-length integers, containing instructions such as: insert 'e' at position 0, then insert 'h' at position 0, then insert 'l' at position 2 and 3, .... That's great if all of the characters are close together, as they tend to be for Unicode text in a single script (it's quite unusual that you'd mix Chinese and Arabic scalars, for instance). You have a large initial delta to set the starting code-point, then a bunch of much smaller deltas.

Contrast that to something like UTF-8, where every code-point must encode its full 21-bit Unicode scalar value, regardless of which characters came before it. And then you have to percent-encode that stuff, which triples its length. I included some examples in my previous post and the difference can be dramatic.

The downside? I can't find examples of anybody using Bootstring for anything other than IDNA/Punycode. I don't know why not - maybe nobody who had this kind of use-case was aware of it. The RFC explicitly mentions that it is designed to scale to other uses, though.

The demo I posted earlier doesn't work any more (I use that repo for various experiments), but the patch is still up and should still work.