Exactly. I've posted a lot, but I'll try to explain it one more time because there appears to be some confusion and I don't think I've done a great job of it so far.
There are 2 transformations:
Swift symbol -----> File name -----> URL path
① ②
When the static website is generated, we go left to right, doing transformation (1) to get the file names, then transformation (2) to get the URL path. When requests are processed, only transformation (2) needs to be reversed; the server only cares about finding the static file; it's not interested in which particular Swift symbol that corresponds to.
Transformation (2) is what percent-encoding is for: when we have content that needs to be escaped to fit in a URL. But what should we use for (1)? Well, as has been suggested a few times, we could use percent-encoding for that as well (i.e. storing percent-encoded names on disk). But here are the problems with that:
-
We'd have to double-encode.
If we only single-encoded these names, they would be extremely fragile. It is important that we never encode additional characters, and that we never decode any part of the name, otherwise the result wouldn't match the name of the file on the filesystem, and we wouldn't be able to find it.
As it turns out, there are all kinds of ways that percent-encoding gets implicitly added and removed (the characters to encode even depend on which part of the URL you're in), to the extent that we really can't guarantee stability with only a single level of encoding. We would need to perform 2 rounds of percent-encoding - encoding once, then encoding that encoded content again.
-
Double-encoding is very inefficient.
Percent-encoding is highly inefficient. It takes 3 bytes to store a single escaped byte, and 5 bytes to store a single byte if you double-encode. That's a 5x bloat factor.
0xAB(1 byte) ->"%AB"(3 bytes) ->"%25AB"(5 bytes)This gets a lot worse when you consider Unicode names with multi-byte UTF-8 sequences. Many CJK scalars need 3 bytes of UTF-8 to encode, which blows up to 15 bytes(!) of double-percent-encoding.
-
Double-encoding is easy to get wrong.
Consider developers, perhaps working on DocC/DocC-Render, or developing any of their own tools using them (given all the documentation tools we have/had - from Jazzy to
swift-doctoswift-biome- it's certainly plausible that somebody would want to build custom tools atop DocC).If you see a string like
"%AB", it's hard to tell which stage you're at. If I decode this string, am I reversing (2) and turning some URL content in to a file name? Or would I be reversing (1) and getting a symbol name?Strict API discipline and vigilance can make this workable, which means it isn't really workable. Indeed, developers have tripped over it constantly since percent-encoding was created - often because they "defensively decode" strings, which means decoding too many times, which has been the cause of all kinds of bugs and even security vulnerabilities (not that I think security is a concern in this case; it just illustrates that developers have a hard time tracking how many times each string has been decoded).
None of this should be a massive surprise - of course using the same escaping format for (1) and (2) is going to be confusing! API discipline aside, it is desirable for the inputs to (1) and (2) to be easily distinguishable at any point in the process.
--
That's why I suggest that we ignore percent-encoding. The fact that these names end up in a URL is a red herring; percent-encoding is transformation (2), and there is absolutely nothing at all which requires us to use percent-encoding for (1).
If we used an alternative, such as base64 or bootstring for (1), we could actually reduce (2) to being a no-op. There would be no double-encoding, it's more efficient, and it's a lot harder to get wrong because these names look nothing like percent-encoding. And in fact, because it's just LDH, you won't corrupt these names even if you do accidentally percent-decode them too many times.
Illustration:
// Percent-encoding:
// Symbol name File name URL path
perform(loghandle:) ---> perform%28loghandle%3A%29 ---> perform%2528loghandle%253A%2529
// Bootstring:
// Symbol name File name URL path (unchanged)
perform(loghandle:) ---> -performloghandle-wta1a63a ---> -performloghandle-wta1a63a
// Percent-encoding:
// Symbol name File name URL path
天空 ---> %E5%A4%A9%E7%A9%BA ---> %25E5%25A4%25A9%25E7%25A9%25BA
// Bootstring:
// Symbol name File name URL path (unchanged)
天空 ---> -fws488e ---> -fws488e
(Note: in these examples, I'm dropping the "xn--" prefix in favour of just a leading "-". Which I think we could do.)
EDIT: I threw together a working demo to get a better look at it. Everything just works AFAICT.
-
Simple names such as
hostnameremain unchanged:(
https://karwa.github.io/swift-url-docs-test/main/documentation/weburl/weburl/hostname) -
All other names are encoded with bootstring:
(
https://karwa.github.io/swift-url-docs-test/main/documentation/weburl/weburl/-serializedexcludingfragment-k6a0c21b) -
The files are all LDH. You should be able to host those from just about anywhere, and we can remove an ad-hoc workaround for leading periods (which today are prefixed with an extra
'so the OS doesn't consider them to be hidden files).
Here's a patch for DocC which will generate these file names, if you want to test it yourself: GitHub