DocC colons in filenames

Exactly. I've posted a lot, but I'll try to explain it one more time because there appears to be some confusion and I don't think I've done a great job of it so far.

There are 2 transformations:

Swift symbol -----> File name -----> URL path
               ①               ②

When the static website is generated, we go left to right, doing transformation (1) to get the file names, then transformation (2) to get the URL path. When requests are processed, only transformation (2) needs to be reversed; the server only cares about finding the static file; it's not interested in which particular Swift symbol that corresponds to.

Transformation (2) is what percent-encoding is for: when we have content that needs to be escaped to fit in a URL. But what should we use for (1)? Well, as has been suggested a few times, we could use percent-encoding for that as well (i.e. storing percent-encoded names on disk). But here are the problems with that:

  • We'd have to double-encode.

    If we only single-encoded these names, they would be extremely fragile. It is important that we never encode additional characters, and that we never decode any part of the name, otherwise the result wouldn't match the name of the file on the filesystem, and we wouldn't be able to find it.

    As it turns out, there are all kinds of ways that percent-encoding gets implicitly added and removed (the characters to encode even depend on which part of the URL you're in), to the extent that we really can't guarantee stability with only a single level of encoding. We would need to perform 2 rounds of percent-encoding - encoding once, then encoding that encoded content again.

  • Double-encoding is very inefficient.

    Percent-encoding is highly inefficient. It takes 3 bytes to store a single escaped byte, and 5 bytes to store a single byte if you double-encode. That's a 5x bloat factor.

    0xAB (1 byte) -> "%AB" (3 bytes) -> "%25AB" (5 bytes)

    This gets a lot worse when you consider Unicode names with multi-byte UTF-8 sequences. Many CJK scalars need 3 bytes of UTF-8 to encode, which blows up to 15 bytes(!) of double-percent-encoding.

  • Double-encoding is easy to get wrong.

    Consider developers, perhaps working on DocC/DocC-Render, or developing any of their own tools using them (given all the documentation tools we have/had - from Jazzy to swift-doc to swift-biome - it's certainly plausible that somebody would want to build custom tools atop DocC).

    If you see a string like "%AB", it's hard to tell which stage you're at. If I decode this string, am I reversing (2) and turning some URL content in to a file name? Or would I be reversing (1) and getting a symbol name?

    Strict API discipline and vigilance can make this workable, which means it isn't really workable. Indeed, developers have tripped over it constantly since percent-encoding was created - often because they "defensively decode" strings, which means decoding too many times, which has been the cause of all kinds of bugs and even security vulnerabilities (not that I think security is a concern in this case; it just illustrates that developers have a hard time tracking how many times each string has been decoded).

    None of this should be a massive surprise - of course using the same escaping format for (1) and (2) is going to be confusing! API discipline aside, it is desirable for the inputs to (1) and (2) to be easily distinguishable at any point in the process.

--

That's why I suggest that we ignore percent-encoding. The fact that these names end up in a URL is a red herring; percent-encoding is transformation (2), and there is absolutely nothing at all which requires us to use percent-encoding for (1).

If we used an alternative, such as base64 or bootstring for (1), we could actually reduce (2) to being a no-op. There would be no double-encoding, it's more efficient, and it's a lot harder to get wrong because these names look nothing like percent-encoding. And in fact, because it's just LDH, you won't corrupt these names even if you do accidentally percent-decode them too many times.

Illustration:

// Percent-encoding:

// Symbol name               File name                          URL path
perform(loghandle:)   --->   perform%28loghandle%3A%29   --->   perform%2528loghandle%253A%2529

// Bootstring:

// Symbol name               File name                          URL path (unchanged)
perform(loghandle:)   --->   -performloghandle-wta1a63a   --->  -performloghandle-wta1a63a
// Percent-encoding:

// Symbol name          File name                  URL path
天空             --->   %E5%A4%A9%E7%A9%BA   --->   %25E5%25A4%25A9%25E7%25A9%25BA

// Bootstring:

// Symbol name          File name                  URL path (unchanged)
天空             --->   -fws488e          --->      -fws488e

(Note: in these examples, I'm dropping the "xn--" prefix in favour of just a leading "-". Which I think we could do.)

EDIT: I threw together a working demo to get a better look at it. Everything just works AFAICT.

  • Simple names such as hostname remain unchanged:

    (https://karwa.github.io/swift-url-docs-test/main/documentation/weburl/weburl/hostname)

  • All other names are encoded with bootstring:

    (https://karwa.github.io/swift-url-docs-test/main/documentation/weburl/weburl/-serializedexcludingfragment-k6a0c21b)

  • The files are all LDH. You should be able to host those from just about anywhere, and we can remove an ad-hoc workaround for leading periods (which today are prefixed with an extra ' so the OS doesn't consider them to be hidden files).

Here's a patch for DocC which will generate these file names, if you want to test it yourself: GitHub

2 Likes

Indeed, which is what I mentioned in my post. But as I’ve also said, the risk of collision is significantly lower (to the point, I think, where it may be considered quasi null) when using characters in this range (and a 1-to-1 mapping), than when doing an N-to-1 mapping to an otherwise-commonly-used character such as _.

I’ll try to make my point clearer, because I’ve obviously failed to do so in my previous post: what I meant was that if we accept the compromise of having a collision risk by going for a character substitution approach, then we should go for a set of substitution rules that minimizes the collision risk while maximizing the readability of the URLs.
It would still be a compromise of course. But I proposed that because @ethankusters mentioned going into a similar direction but with URL readlibity in mind.

[just nitpicking] It’s not necessarily true. It is probably possible to find some Unicode characters which are not valid for use in Swift code, but valid for filenames (if anything, by using characters from the private use areas), and if there are enough of them to cover the (rather small) set of reserved filename characters then a collision-risk-free mapping would be possible.
But then of course, it would mean doing away with the notion of preserving URL readability, and then we may as well go for the approach proposed by @Karl instead.

I’m really curious to know if anybody actually uses the characters in the U+FF01–FF5E range in Swift code. If nobody does, I think it would be worth it to modify the grammar to exclude this range from the set of valid characters, in order to reserve it for character substitution (yeah, I’m just dreaming here :slightly_smiling_face:).

Just out of curiosity, may I ask why?

swift-biome is completely encoding-agnostic. it does not issue redirects due to differences in percent-encoding alone. it can do this because %[A-Fa-f0-9][A-Fa-f0-9] is not a valid swift operator.

Hi all! Is there any progress on this?

I think that there are still a few unanswered questions that makes the long term solution not yet actionable and no-one is driving that discussion.

Personally I would be in favor of an incremental solution that adds < > : " / \ | ? * to the character set that DocC replaces with _ as long as it's an opt-in configuration.

I find it very nice and readable that the Swift multiplication operator can have the path /documentation/swift/numeric/*(_:_:) and would consider it a regression if this changed for hosting environments that can support these characters.

2 Likes

The problem is that this also means that it is implicitly broken. Having a secondary path means that it is going to be an afterthought and will frequently become an issue. The fewer different configurations there are, the simpler the model and less likely that changes will regress the functionality.

Punycode encoding the URL might allow that as URL bars could decode the punycode encoding.

1 Like

Whilst URL readability is nice, I don't believe it should outweigh all other considerations.

Consider that DocC already replaces many characters with _, and some components may include suffixes with hashes, neither of which are particularly readable. For example, the < and > operators provided by the Comparable protocol appear as _(_:_:)-74jbv and _(_:_:)-3s5ym right now. Can you tell which one is which?

If you look at the documentation for Int's operators, & is allowed (&(_:_:)-8h2q8), but | and ^ are substituted (_(_:_:)-26x3w and _(_:_:)-591r5). Of course it also has comparison operators, so there are actually quite a lot of _(_:_:)-blah paths (as well as minor variants, such as _=(_:_:)-blah and __(_:_:)-blah).

So whether you get a path component like *(_:_:) or _(_:_:)-74jbv seems to be a kind of accident that happens sometimes, and is not at all reliable. Readability is nice, but it is already so limited that I don't think we should cling on too desperately to the couple of examples where a handful of special characters allow for a marginal readability win at the expense of portability.

When it comes to writing URLs, If I'm entering a documentation URL in the browser, I'm generally either looking for something on developer.apple.com or one of my own projects (relying on the browser's autocomplete to fill in the first half of the URL). I know the path starts with /documentation, then a module name (say, /uikit), then a type (perhaps /uicollectionview). That works, and it's a nice feature, and it should continue to work; but if I'm looking for something more specific, or something involving special characters, I think it's better to use search instead. Nobody wants to write out paths like *(_:_:), and if it includes hashes or substitutions, you can't even know what the path component is.

They wouldn't. The contents of path components are opaque; only Punycode in the hostname should be decoded. Besides, we would need to customise some parameters of the bootstring encoding (it does not generally encode ASCII code-points at all), which means browsers wouldn't be able to decode it.

The benefits of a Bootstring encoding (of which Punycode is one example) is that it

allows a string of basic code points to uniquely represent any string of code points drawn from a larger set.

  • Every unique input produces a unique result

  • It is reversible (provided you know the parameters it was encoded with)

  • The "basic code points" can be defined arbitrarily - we can allow ASCII alphas and digits, maybe * if that's important, etc. It's entirely up to us to decide which characters are allowed.

  • Basic code points in the original string are left unchanged, so for the common case where we have ASCII identifiers, we can maintain high readability and avoid percent-encoding.

  • It is very compact for real-world Unicode text.

    The way Bootstring encodes its data is with a series of variable-length integers, containing instructions such as: insert 'e' at position 0, then insert 'h' at position 0, then insert 'l' at position 2 and 3, .... That's great if all of the characters are close together, as they tend to be for Unicode text in a single script (it's quite unusual that you'd mix Chinese and Arabic scalars, for instance). You have a large initial delta to set the starting code-point, then a bunch of much smaller deltas.

    Contrast that to something like UTF-8, where every code-point must encode its full 21-bit Unicode scalar value, regardless of which characters came before it. And then you have to percent-encode that stuff, which triples its length. I included some examples in my previous post and the difference can be dramatic.

The downside? I can't find examples of anybody using Bootstring for anything other than IDNA/Punycode. I don't know why not - maybe nobody who had this kind of use-case was aware of it. The RFC explicitly mentions that it is designed to scale to other uses, though.

The demo I posted earlier doesn't work any more (I use that repo for various experiments), but the patch is still up and should still work.

2 Likes

to add another data point: although swift-biome long evolved past this (it now percent-encodes everything), Biome v0.1 (back then it was called Entrapta) just spelled out the operator characters in english.

https://kelvin13.github.io/godot-swift/infix operator tilde-tilde-/

https://kelvin13.github.io/godot-swift/VectorRangeExpression/tilde-=(pattern:element:)/

the spaces in the symbol name prevented it from colliding with named methods, which was needed because back then Entrapta deployed to github pages, and did not have the concept of path orientation that Biome uses today.

1 Like