Unsafe characters in file names redux

compnerd · June 22, 2023, 4:12pm

I have been working on getting DocC support for Windows. Overall, we are fairly close to that (~94% pass rate for the test suite). However, one item that is blocking progress is a topic that was previously discussed - the use of colons in the file name.

File systems have different supported characters. On Windows, the following character set is not acceptable: [<>:"/\|?*] ([ and ] being the delimiters of the set). Beyond the unsafe characters, Windows also reserves the following identifiers:

CON
PRN
AUX
NUL
COM[1-9]
LPT[1-9]

As Windows generally uses a case insensitive file system, so all possible variations of upper and lower case characters spelling those names are reserved.

The overall argument is that : provides a nicer URL, however, that does prevent the creation of files on disk for Windows. Altering the naming convention for the files would resolve this though would mean that we cannot use : (amongst other characters above) in the file name.

IMO having a single format for the name is somewhat important to ensure that we are properly testing the software and that we can build documentation that can be easily hosted and moved across platforms as well as generated on any platform. Having options to change the behaviour would result in certain builds being less tested. Having a single path through the various different libraries and tools so far has been beneficial on the other projects with both providing better testing and shared bug fixes.

Hopefully we can come to a consensus on how to handle this soon so that we can enable DocC on Windows as well.

CC: @ronnqvist

ronnqvist · June 22, 2023, 6:32pm

To me the readability of the URLs is an important goal—and I have some unrelated ideas for how to improve that further—but the bigger issue with removing characters that are used in URLs today is that it breaks existing URLs that people and other systems may be relying on.

These characters appear both in the web URL and in the page identifiers in the DocC "linkable entities" file which is part of its interface to bridge with other systems.

Even if we come to a consensus for a single format I think we'd want to offer a full release cycle for migration before we can make it the new default format and we may also want to support the current format for some time after that.

Considering how much work it can be to do a large scale link/content migration like that—I've already deferred one such attempt to remove the language info from symbol kind disambiguation in the web URLs—and considering that we'd likely have two formats for an extended period of time regardless I'm not fully convinced that the benefits of a single format in code and testing efforts would be worth it.

ethankusters · June 22, 2023, 6:36pm

I'd be interested in exploring a bit if the web URL really needs to be tied to the on-disk naming of the Render JSON and index.html file directories. Maybe we can do some work in DocC-Render to decouple this and allow the file URL to be percent encoded (or similar) while keeping the clean web URLs?

CC: @dhristov @marcus_ortiz

ksluder · June 22, 2023, 6:46pm

Wouldn’t that further tie DocC to client-side JavaScript rendering? I would prefer if DocC could eventually produce plain, pre-rendered, JavaScript-free .html pages that could be served directly from disk, which would not be able to rely on such a mapping scheme.

taylorswift · June 22, 2023, 6:50pm

DocC is already heavily dependent on vue.js. i am also not too enthusiastic on the “serve a gigantic index file and then render everything on the client-side” architecture, but it seems to be too entrenched to change course today.

ksluder · June 22, 2023, 6:51pm

I don’t think it’s too late, but it’s definitely an alternative pathway. Which is why I refined my post to talk strictly about whether such a solution would preclude pursuing that pathway, and that I’m not trying to hijack this discussion into a complete rewrite.

taylorswift · June 22, 2023, 6:58pm

the problem with static HTML files, aside from the file path characters issue, is that it does not scale. they use too much disk space, and the problem compounds when you want to host many versions of documentation for many packages.

now, DocC also has similar storage consumption problems, which is why if you compare it as-is with the HTML archive idea, emitting pre-rendered HTML files does not seem like such a bad alternative. but that architecture has inherent resource usage constraints, whereas the problems with DocC in my view are a matter of poor implementation, and could in theory be mitigated.

ksluder · June 22, 2023, 7:01pm

We should take further discussion of the pros and cons of pre-rendered HTML elsewhere, as long as we can agree that this isn’t something we should foreclose upon by committing simultaneously to 1) never changing the URL scheme and 2) using client-side logic to map URL components to server-side path names.

taylorswift · June 22, 2023, 7:05pm

i’m not too familiar with Windows, but could we sidestep the issue by using percent encoding (%3A) to encode colons?

ksluder · June 22, 2023, 7:10pm

Not really. That only affects how the colon is transmitted between the user-agent and the webserver; it doesn’t affect the fact that the webserver will not be able to ask the filesystem for the contents of a file with a ":" in the name.

Though @compnerd might know if UNC paths (ones that begin with \\?) can refer to reserved or illegal filenames.

Surely Windows webserver vendors, Cygwin developers, and the WSL team have already found some solution to this problem.

taylorswift · June 22, 2023, 7:13pm

can’t the percent-encoding be used for the file name itself? (if i remember right, % is a legal Windows path character.)

then whatever middleware is being used to serve the files can just be configured to always canonicalize : to %3A.

Joe_Groff · June 22, 2023, 7:14pm

Even if you use \\?\ paths to bypass the other usual Win32-level restrictions, : is still problematic since it's NTFS's "alternate data stream" delimiter to indicate a separate "fork" in a file (a legacy of classic Mac OS network interop, ironically enough). And while you can create a reserved-name file like \\?\C:\con\con through UNC paths, a lot of other Windows apps will break in funny ways if they try to process that file, so it's not a great idea.

ksluder · June 22, 2023, 7:15pm

That would invalidate the existing URL scheme, at which point why not just pick one that avoided the reserved Windows characters/names?

ksluder · June 22, 2023, 7:17pm

The main data stream is accessible as the :$DATA named stream, so if you were writing a maximally-compatible webserver you could canonicalize all incoming paths to use :$DATA. Though I highly doubt anyone has done so.

taylorswift · June 22, 2023, 7:17pm

how would it invalidate the existing URL scheme?

if the HTTP layer receives a request for Swift.min(_%3A_%3A), it can simply pass that as-is to the file system.

if the HTTP layer receives a request for Swift.min(_:_:), it canonicalizes it to Swift.min(_%3A_%3A), and then passes that path to the file system.

ksluder · June 22, 2023, 7:23pm

Fair enough, but if you’re going to change the naming scheme of files on disk (and broken the symmetry between URLs and the filesystem), why not just change it to a simpler one, e.g. one that replaces : with #?

taylorswift · June 22, 2023, 7:25pm

it could, but i’m going off the assumption that most HTTP-to-file system servers already have a way to configure percent encoding policies whereas a custom escape scheme like : → # requires a custom-built server.

compnerd · June 22, 2023, 8:11pm

This seems like a rather interesting idea to me. This would, as far as I am concerned, permit us to have uniform behaviour across the various environments.

Files on disk cannot encode the invalid characters. The request would fail to find the file I believe.

No, as @Joe_Groff correctly pointed out, the NT style paths (\\?\ prefixed) paths nor UNC (\\unc\ prefixed) paths cannot name files which are reserved.

dhristov · June 23, 2023, 6:02am

Should not be too hard to have a set of "forbidden" url characters, that are transformed/encoded just before sending a request.

It would mean a breaking change on routing though, as files that have those forbidden chars without escaping would not be found.

sspringer · June 23, 2023, 6:29am

Maybe for a transitional period also create HTML files with the old file names when allowed on the platform (or by an option), maybe just containing a redirect if this works?

Such additional files with the old file names could also display a warning.