DocC colons in filenames

amyworrall · June 8, 2022, 3:50pm

I'm trying to use DocC with transform-for-static-hosting. It's spitting out filenames with colons in them, and our version control system is warning that those will be incompatible with Windows.

Can we have an option to not use colons in filenames?

ethankusters · June 8, 2022, 7:03pm

I looked into this when we implemented this feature and my understanding at the time was that Windows handles colons gracefully by replacing them with percent-encoding. Finder does something similar on macOS I believe. If that's incorrect, we could look into having DocC emit with percent-encoding by default which might help here...

Have you run into issues with hosting or accessing these files on Windows or is it possible the version control warning is incorrect? Curious if @compnerd has any thoughts here as well. Definitely want to make the output of DocC as cross-platform compatible as possible.

compnerd · June 8, 2022, 7:36pm

: is not permitted in filenames on Windows. See: Naming Files, Paths, and Namespaces - Win32 apps | Microsoft Docs

<, >, :, ", /, \, |, ?, * are not permitted in file names. There will be no pattern substitution to filenames by the file system, the name requested is what it will attempt to use, and any invalid characters will result in an operation failure. It is possible to do URL encoding of the filenames, but that will need to be a) done at the DocC level b) require percent encoding the percent encoding in the URL Overall, it is going to hurt more to continue to persist the characters rather than to simply restrict the filenames.

Karl · June 8, 2022, 7:48pm

It isn’t just characters - Windows has forbidden file and folder names and patterns.

Percent-encoding is a URL-level escaping mechanism that has no meaning to the file system. I’m not sure what exactly is going on here, but if DocC is using percent-encoding as an escaping mechanism on the local file system I’d advise against that; percent-encoding is horribly inefficient (every escaped byte triples in length), which can lead to hitting path length limits more easily.

Base64 encoding is much more efficient; that’s why it tends to be used in data: URLs over percent-encoding.

sofiaromorales · June 8, 2022, 8:02pm

Hey Amy, AFAIK the filenames must match the symbol names from the codebase in order to render correctly, so I don't think there's an option to omit colons, did you already try to serve these files on Windows?

compnerd · June 8, 2022, 8:17pm

@sofiaromorales, I don't see how you could even try to serve the files on Windows, you would never be able to create the file on the filesystem in the first place. As an example:

import Foundation
Foundation.FileManager.default.createFile(atPath: "invalid:file", contents: nil)

If you build and run this program, it will create a file called invalid and associate an ADS of file to invalid.

06/08/2022  01:14 PM                 0 invalid
                                     0 invalid:file:$DATA

If you were to use something more complex:

import Foundation
print(Foundation.FileManager.default.createFile(atPath: "createFile(atPath:contents:)", contents: nil))

It will print false as the file cannot be created.

Am I missing something and you have something else in mind for serving this from Windows where the file cannot be put on the file system?

sofiaromorales · June 8, 2022, 9:04pm

hey @compnerd, yeah, I know that : is a forbidden character on Windows directory names and is not possible to serve files locally without having them on your disk, but I was thinking specifically with the docc archive file there might be some kind of character replacement happening (similar to what finder does with :) and that the version control system is just giving a misleading warning

compnerd · June 8, 2022, 9:13pm

As @ethankusters also thought that there would be pattern substitutions: AFAIK, Windows performs no replacement of characters and if the filename has invalid characters, in general the file is discarded; in the case of a :[\w]+ suffix, the suffix will be treated as an ADS and the file will be renamed to whatever the prefix is. That is to say, I don't think that the warning is misleading.

FWIW, this was a problem that was encountered in swift-doc as well and had to do character replacements (c.f. perform some path sanitization over the generated names by compnerd · Pull Request #258 · SwiftDocOrg/swift-doc · GitHub).

ethankusters · June 8, 2022, 10:08pm

Sounds good. @amyworrall could you file an issue in the Swift-DocC repo when you have chance? Thank you for bringing this up!

I think we'll want to do something similar to what @compnerd implemented in swift-doc. My main concern is doing this in a way that keeps the presentation URLs the way they are (with the reserved characters) but I think we can work through that.

amyworrall · June 9, 2022, 12:18am

https://github.com/apple/swift-docc/issues/284

tevamerlin · June 13, 2022, 4:49pm

I think we'll want to do something similar to what @compnerd implemented in swift-doc.

Character substitution when generating filenames is indeed probably the right approach.
However, it seems (if I’m not mistaken) that @compnerd has chosen to replace them all with the same character (_). I suggest not doing this, as it can easily lead to filename collisions (between symbols whose names differ only by one character, which happens to be one of the reserved chars). In fact, the use of _ may lead to collisions even with symbols which do not use any reserved character in their name, as illustrated here:

github.com/SwiftDocOrg/swift-doc

Character substitution in file names can lead to collisions

opened 07:14AM - 20 May 21 UTC

Lukas-Stuehrk

bug

When generating file names for symbols, `swift-doc` replaces some characters of …the symbol's name with underscore (_). This can lead to the problem that it produces the same file name for different symbols with different names. The second symbol then overwrites the page of the previously declared symbol. This happens rather often with operators: ```swift infix operator >>> public func >>> (lhs: String, rhs: String) { } infix operator <<< public func <<< (lhs: String, rhs: String) { } ``` Both create the file `___`. But it can also happen for any other symbol: ```swift public class Outer { public struct Inner {} } public class Outer_Inner {} ``` Both symbols create the file `Outer_Inner`.

My main concern is doing this in a way that keeps the presentation URLs the way they are (with the reserved characters)

How about generating URLs that are not the same but look close enough?
All the reserved characters listed above by @compnerd have fullwidth variants in the FF00 block of Unicode. It would be possible to use those as substitutes in filenames and URLs.

< > : " / \ | ? * would map to ＜＞：＂／＼｜？＊.

Since the fullwidth characters are valid for use in identifiers, the risk of name collision is not strictly null. But I’m willing to bet their use is rare enough (likely non-existent) that we can consider the risk to be negligible, particularly if we add to that the fact that they would be used as substitutes for operator characters and not identifier characters.

What do you think?

tevamerlin · June 13, 2022, 4:58pm

Btw, this is not just a Windows-related problem. On Linux and MacOS, there are problems with the documentation for operators using / in their name.
Not the same symptoms — the files are created, but obviously without the / in their name, leading to name collisions (e.g., if you have a + operator and a /+ operator).

compnerd · June 13, 2022, 6:31pm

I believe that those characters are still valid for the names, so you still have the collision concern. Any type of attempt to limit the filename and substitute the character is going to hit this. The way to avoid that is to introduce an escaped sequence, which would increase the filename length. You would need to escape the escape specifier as well (e.g. for percent encoding you would need to escape % as %%). I would rather that we use the percent encoding over the unicode characters in the names.

Karl · June 13, 2022, 7:57pm

Do you mean %25?

Let’s say I have the string “%AB” (literal, not a percent-encoded byte):

If we escaped the % with another %, we get %%AB. Decoding this results in ”%” + 0xAB - i.e. we failed to stop “%AB” being interpreted as a percent-encoded byte, and got corrupted data back. %25AB decodes to “%AB”, so preserves the original string.

--

That being said, as I mentioned previously, I think bringing percent-encoding to the filesystem would be unwise. It already has meaning and expected behaviour in URLs (e.g. over-encoding is typically allowed, such as %41 instead of "a"), and that would not be the same here. Additionally, most URL APIs (such as Javascript's URL class, or WebURL) are lenient about % signs and won't encode them, expecting that they are always being used for percent-encoding:

// Javascript
var url = new URL("http://example.com/foo");
url.pathname = "%AB";
url;  // "http://example.com/%AB" not "%25AB"

That means you'll need to manually encode those "%" signs before setting the content.

I understand that, because these names can be used in URLs, it might seem reasonable to use the escaping mechanism already present in URLs, but that would be a mistake. We're not escaping URL content itself here; we're escaping the things the URL content is referencing. It's totally different, and using the same escaping mechanism for both is an easy way to introduce confusion and bugs.

--

base64 is more compact, is supported basically everywhere (it is even built-in to JavaScript), is not confusable with percent-encoding, and actually already has standardised URL-safe and filesystem-safe variants. It seems ideal for this.

It isn’t as readable, but I would argue that percent-encoding also isn’t readable, and double-encoding or custom escaping also isn’t readable. Besides, it doesn’t need to apply to every component - only ones containing characters that are not obviously filesystem-safe (e.g. plain alphanumerics are generally fine, outside of Windows reserved names like “CON”. 99% of symbols won't need to be encoded at all).

ethankusters · June 13, 2022, 8:05pm

@Karl I can see how that approach could work for the JSON Swift-DocC emits and for dynamic hosting scenarios. I'm not yet convinced that losing readability is a worthy trade-off here but it definitely seems worth investigating this approach.

My main question is how would this strategy work for the directories SwiftDocC creates for the index.html files in static hosting scenarios?

For example: https://github.com/apple/swift-docc/tree/gh-pages/docs/documentation/swiftdoccutilities/convertaction/perform(loghandle:).

Here the filesystem representation of the paths is 1-1 with the presentation URLs served by the browser.

It doesn't seem like a basic file-hosting server will "just work" with base 64 encoded paths but maybe I'm missing something here.

Karl · June 13, 2022, 8:18pm

There is going to be a readability trade-off either way - a file named perform(loghandle:) won't work.

With percent-encoding, you get a longer URL with %WX%XY%YZ sequences everywhere. With base64, it's all jargon, but it's much shorter (usually; depends how much needs to be escaped). Neither of them are particularly readable, IMO, so I'm not sure it's worth prioritising readability. I would be more concerned about length (some systems have very strict path length limits, and Swift symbols can be rather long), and ensuring developers can easily distinguish an escaped URL path component from an unescaped path component containing the filesystem-safe escaping of a symbol.

EDIT:

Actually, perhaps some kind of bootstring encoding (ala Punycode) would be even better than base64 .

It keeps non-escaped characters in the string intact (e.g. München -> Mnchen-3ya), so it's almost readable (certainly a lot better than percent-encoding) and is a more obvious length improvement when only a few characters need to be escaped.

taylorswift · June 13, 2022, 8:40pm

to provide an extra data point, the swiftinit docs do not escape colons (in their canonical urls), but do escape parentheses ( -> %28 ) -> %29.

example:

swiftinit.org/reference/swift-nio/niocore/bytebufferview.suffix%28_:%29

the rationale for this is that browsers like firefox will not display a percent-encoded colon as a colon in the navigation bar, so it is better to avoid percent-encoding it.

one of the benefits of dynamic hosting is that you can solve these kinds of issues with the url parser. both of the following are supported url spellings:

swiftinit.org/reference/swift-nio/niocore/bytebufferview.suffix(_:)

swiftinit.org/reference/swift-nio/niocore/bytebufferview.suffix%28_%3A%29

Karl · June 14, 2022, 3:34am

Quick proof of concept, modifying WebURL's Punycode implementation to encode all characters except LDH (letters/digits/hyphen). It's a single-file, self-contained demo.

printEncoded("somesymbol123")
Original: somesymbol123
Encoded:  somesymbol123
Decoded:  somesymbol123

printEncoded("bytebufferview.suffix(_:)")
Original: bytebufferview.suffix(_:)
Encoded:  xn--bytebufferviewsuffix-uyaw1iwn26a
Decoded:  bytebufferview.suffix(_:)

printEncoded("perform(loghandle:)")
Original: perform(loghandle:)
Encoded:  xn--performloghandle-wta1a63a
Decoded:  perform(loghandle:)

printEncoded("test-[]!{}@<>|£:}{!@^&£)$&")
Original: test-[]!{}@<>|£:}{!@^&£)$&
Encoded:  xn--test--2faayrb5a5rza2a2ae66bfb0a78ceutf26gha
Decoded:  test-[]!{}@<>|£:}{!@^&£)$&

There are parameters which could be tweaked to give an even more compact result, and we could replace the xn-- prefix with something else. It's only a quick proof of concept.

But this kind of encoding has a lot of advantages. It is generally going to be a lot more compact than percent-encoding (especially when unicode characters are involved), is actually quite readable IMO, and since it is limited to LDH, it is already URL and filesystem-safe. It will never even need percent-encoding, meaning we get to avoid all of that complexity and the ambiguities that come with double-encoding.

It should also be much less susceptible to embedding issues. For example, Xcode considers "scheme://x/abc**cd" as starting bold text due to the unescaped **. Because this encoding escapes all non-LDH characters, we should be able to avoid conflicts with delimiters in most documents.

wowbagger · June 14, 2022, 5:22am

What if we just use mangled names?

jack · June 14, 2022, 5:29am

We have encountered mangled names and USRs that are many megabytes in length (while this is an edge case, kilobyte+ names are not exceedingly rare); I would like to avoid encoding mangled names into file names.

We don't have a web URL problem, we have a filesystem path problem. We need a way to reliably take an existing web URL and turn it into a file path that any OS can handle. That said, is any url transformation even possible for a static web server? I'm not terribly familiar with that part of the stack. We can encode whatever rules we'd like into our single-page app, but static hosts aren't aware of those rules.