Next steps: File URLs

Karl · June 1, 2021, 11:43pm

Hi

The next feature I'm working on for WebURL is file URL <-> path conversion. This thread is to let you know what's happening, what the challenges are, and to discuss if there are any issues with the approach. I would be particularly grateful for any feedback on how I intend to handle Windows paths, as it's the trickiest platform and the one I have the least experience with.

The goal is that you can create a file URL from a local path, give it to an application which supports both local and remote resources, and that application will be able to reverse the transformation to end up at exactly the same local path.

The most challenging aspect of this is Windows support, as it has 3 different kinds of paths (DOS paths, UNC paths, and DOS device paths), each of which has sub-types with their own sets of challenges. I'm basing the implementation on Chromium's utilities, because there is a general consensus among participants in the URL Standard to align with Chrome when it comes to handling files on Windows. However, Chrome's implementation is not extremely well tested, so some aspects of its behaviour are not very well defined, and other aspects make it impractical for direct exposure to developers as API (e.g. you can create file URLs with hostnames on any platform, but those URLs will fail to round-trip back to paths on any platform but Windows). So some adjustments will have to be made and test coverage will have to be expanded.

I'd like for WebURL to incorporate modern best practices for security for reliability, and that means some kinds of paths won't be supported, sometimes depending on the platform:

Relative paths will not be supported.

URLs cannot express relative paths - for some URL file:///foo, there is no way to say that foo should be resolved relative to the working directory of application processing the URL. This means that when creating a file URL from a path, we would need to resolve it in to an absolute path.

Automatically resolving such paths has the potential to leak internal server/application details - for instance, if some part of the path contains data derived from user input, it may be possible to escape the intended filesystem subtree so that the resolved path points to an internal configuration file. This is known as a directory traversal attack.

In order to help mitigate these kinds of errors, it will not be possible to create a file URL from a relative path. No automatic resolution will be performed. Similarly, when creating a path from a file URL, if the result would be interpreted by the operating system as a relative path, the function will fail and throw an error instead.

If developers need to create a file URL from a relative path, they can use libraries such as swift-system or utilities provided by the operating system, such as realpath, to resolve the path relative to whichever base directory is appropriate. Those libraries can also provide utilities to check that the resulting path does not escape the intended subtree. I feel that there is value in keeping this as an explicit, separate step for applications which deal in relative paths.

On POSIX, this is relatively straightforward. For path -> URL conversion, we reject paths which do not begin with a "/". For the reverse transformation (URL -> path), we don't need to do anything, since file URLs are assumed to have absolute paths and always start with a "/".

On Windows, things are more complicated. Fortunately, of the 3 kinds of paths, only DOS-style paths can be relative. Unfortunately, DOS-style paths themselves support 3 kinds of 'relativeness':

example relative to

foo\bar current directory, current volume

\foo\bar current volume

D:foo\bar current directory on given volume

Note that even though the last example includes a volume, the path after it does not begin with a '/', so it is considered relative to the current directory of the D: volume (every volume can have a different "current directory" on Windows). Leaving out that slash is a common source of bugs.

I intend that creating a file URL from any of these 3 paths, or creating any of these paths from a file URL, would fail on Windows. This means that, for instance, the URL file:///foo/bar can be turned in to a path on POSIX systems, but not on Windows systems.
Creating a POSIX path from a file URL with a hostname will not be supported.

There is no obvious interpretation of a file host on POSIX systems, where remote filesystems are typically mapped to the local filesystem. We could just ignore the host, but RFC-8089 discourages this and Chrome stopped doing it because it could lead to spoofing issues (e.g. file://accounts.google.com/some/local/file and file:///some/local/file would display the same content). File hosts will be supported on Windows, where they will result in UNC paths.
Creating a path from a URL containing percent-encoded directory separators will not be supported.

For instance, file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd is a valid URL. However, when we create a file path from a URL, we need to decode all percent-encoding (percent-encoding has no meaning in FS paths). This would result in the path /some/folder/../../../../../etc/passwd which again, can leave users open to directory traversal attacks. We're following Chrome here, which determined that this was a security risk.

Note that percent-encoded backslashes (%5C) are allowed on POSIX systems, as the OS does not consider them to be path separators. Windows supports both, so we need to block both on Windows.

It's worth noting that this behaviour is more restrictive than Foundation's URL type, which automatically resolves relative paths using the current process' working directory and allows percent-encoded directory separators. For the reasons given above, I consider both of these to be security risks.

Examples of Foundation doing things I think it probably shouldn't be doing

// Example: Foundation automatically resolves relative paths.
import Foundation

// CWD = /home/myservice/
let templateName = "../../../etc/passwd"
URL(fileURLWithPath: "templates/\(templateName)").absoluteURL // file:///etc/passwd

// Example: Foundation decodes pct-encoded separators
import Foundation

let url = URL(string: "file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd")!
print(url) // file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd
url.withUnsafeFileSystemRepresentation {
  print(String(cString: $0!)) // /some/folder/../../../../../etc/passwd
}

(Note that in the latter case, .standardizedFileURL will decode and simplify the URL to file:///etc/passwd before it gets to the OS, but the regular .standardized will not. So there is technically a way around it, but you need to know precisely what you need and go looking for it).

My hope is that this strikes a pragmatic balance, allowing as many file paths as we can reasonably and reliably support, while learning from some of the security risks that browsers have encountered and improving on the APIs that Swift developers currently have at their disposal. Each of these failure conditions will throw descriptive errors, so if they occur in your application, you'll have some guidance on what went wrong and how you can fix it.

Currently, implementation is blocked by some changes I'd like to make to the URL Standard in order to better support UNC hostnames, but I may release a POSIX-only update if it takes too long to make progress on that issue.

So that's it. What do you think - are these safeguards too restrictive? Have you ever been affected by a directory traversal attack? Or a bug stemming from Foundation's automatic relative path resolution? Do you know of any other edge-cases to consider? It's all useful signal, so I'd appreciate the feedback.

Thanks for reading!

Karl · June 2, 2021, 12:08am

Some interesting links for the curious:

MSDN via archive.org: The bizarre and unhappy story of file URLs

No one had been quite as abused as the the little file: URL. This URL was special because we had always used files and DOS paths (and no one at the time knew about path canonicalization attacks), everyone was quite sure what they looked like , acted like, and even tasted like. It didn't help that the file: protocol remained in RFC limbo as a platform/OS specific protocol. So the browser and the browser's little friends would take turns dressing a DOS path like an URL in a pink bunny suit and undressing the URL with a pair of rusty scissors, pretending it was the same DOS path they started with. Only the simplest of URLs was able to withstand this abuse, and it soon became clear that something would have to be done, lest the little file: URLs go off on their own and be lost forever.
Microsoft: CreateURLMoniker considered harmful

CreateURLMoniker does a couple of horrible things to file URIs that if misused can lead to security bugs.
Microsoft: File URIs in Windows

Invalid file URIs are among the most common illegal URIs that we were forced to accommodate in IE7. [...] The standard for the file scheme doesn’t give specific instructions on how to convert a file system path for a specific operating system into a file URI. While the standard defines the syntax of the file scheme, it leaves the conversion from file system path to file URI up to the implementers.

Since there’s no way to include an IPv6 address in a Windows file path, there’s no corresponding file URI and so there’s no way to incorporate an IPv6 address in file URIs in Windows.
Wikipedia: File URI Scheme
RFC-8089: The "file" URI Scheme (really not as helpful as you'd hope it would be)

[EDIT: And another one. cvedetails.com actually has a category for directory traversal attacks, they are that common]

CVEDetails: Directory Traversal Attacks

Some of these are great (terrible, but great)

IBM Host firmware for LC-class Systems could allow a remote attacker to traverse directories on the system. An attacker could send a specially-crafted URL request that would allow them to delete arbitrary files on the system.

In Django 2.2 before 2.2.21, 3.1 before 3.1.9, and 3.2 before 3.2.1, MultiPartParser, UploadedFile, and FieldFile allowed directory traversal via uploaded files with suitably crafted file names.

A path handling issue was addressed with improved validation. This issue is fixed in macOS Big Sur 11.0.1. A remote attacker may be able to modify the file system. [-- yes, it's macOS! I can't find many details, but it's CVE-2020-27896]

compnerd · June 2, 2021, 4:27pm

Unless you are categorizing things differently, this isn't entirely correct. There are additional path types

DOS Path: C:\Users\compnerd\Desktop
UNC Path: \\hostname\C$\Users\compnerd\Desktop
DOS Device Path: \\?\Volume{GUID}\Users\compnerd\Desktop
NT Object Path: \??\C:\Users\compnerd\Desktop, \??\UNC\hostname\C$\Users\compnerd\Desktop

Of course you have the various forms in each, but there is at least a fourth category of paths that you are overlooking. Those all do reference the same location assuming the hostname and GUID are properly replaced.

Out of curiosity, do you intend to support HFS paths? (Not HFS+, HFS paths of the form Macintosh HD:Users:compnerd:Desktop).

File paths are miserable.

Karl · June 4, 2021, 2:26pm

Hmm, so I've been going by the documentation I linked to, which doesn't mention NT object paths. Thanks for letting me know about them!

Given that NT paths begin with a \, without any special handling they would be rejected for looking like a relative path. After spending a bit of time reading up on them and doing some experiments in a Windows VM, I think that is probably the correct behaviour (although I'll definitely be adding explicit tests for this style of path), for a couple of reasons:

Windows itself doesn't always handle these paths correctly.

As detailed by the Project Zero blog, calling GetFullPathName with an NT object path will treat it like a "rooted" (i.e. drive-relative) path. I tested this in the VM, to see if anything had changed since 2016 when that post was written, but it appears that it hasn't:
Example (C++)
```
#include <iostream>
#include <windows.h>
#pragma comment(lib, "Kernel32.lib")

void testPath(LPCSTR path) {
    char out[MAX_PATH + 1];
    DWORD len = GetFullPathNameA(path, MAX_PATH, out, NULL);
    out[len] = '\0';
    std::cout << "PATH:      " << path << std::endl;
    std::cout << "FULL PATH: " << out << std::endl;
}

int main() {
    testPath("\\??\\D:\\foo");
}
```
Output:
```
PATH:      \??\D:\foo
FULL PATH: c:\??\D:\foo
```
We don't have the source code for Windows, of course, but we can take a peek at the source code for .Net, which uses similarly basic logic (i.e. considering all paths which start with a leading slash to be "rooted", drive-relative paths). There is also some code which suggests that \??\ is equivalent to \\?\. But quite often, I see Microsoft's own APIs treating these like relative paths, so it seems quite reasonable to do the same and ultimately reject them, as we do for other kinds of relative paths. I'm open to suggestions, though, if you think this is important.
Neither Edge nor Internet Explorer support NT object paths in file URLs

The Windows API UrlCreateFromPathA encodes the ?? component directly in the path, as you might expect: \??\X:\ABC -> file:///%3F%3F/X:/ABC, and PathCreateFromUrlA can successfully recreate the path from such a URL. That said, I tried in both Edge and Internet Explorer, and neither would open this URL. So there don't seem to be any legacy/compatibility concerns, AFAICT.

FWIW, I also tried encoding the ?? component as a hostname, but that seemed to be even worse: PathCreateFromUrlA turned the URL file://%3F%3F/C:/Windows in to the clearly wrong UNC path \\??C:\Windows. Again, neither Edge nor Internet Explorer knew what to do with it.
It's not clear that it even makes sense to make file: URLs from these paths

These paths represent very low-level objects, used in very specific contexts and only supported by specific APIs. I'm not sure it really makes sense to turn them in to URLs in the first place (or at least not file: URLs - perhaps a custom scheme?). It may actually be a good thing for that operation to fail; to serve as a signal to developers that they are probably not creating the URL they expected, and that they should use the platform APIs first to resolve them to a DOS/UNC-style path.

This raises an interesting point about the value in limiting the set of inputs which your APIs support. Even if we don't support every single kind of path that could possibly exist on Windows, I don't think that should necessarily be considered a flaw. It makes a good deal of sense to limit the inputs to paths which we (or I) can reasonably understand, test, and maintain, and to do our absolute best to reject every other kind of input. We'd get better guarantees if paths were strongly typed, but being strings, the only option is to have very specific predicates for what is allowed and to reject anything we don't recognise.

One of the reasons URLs are in such a sorry state is that implementors were often too lenient (or just didn't care) about supporting more than was absolutely required. This means that we now have decades worth of obscure legacy features which make URLs harder and slower to parse or reason about, and which hardly anybody actually uses, but which can't actually be removed because of compatibility concerns.

One example: IPv4 address. RFC-1738 (the 1994 standard which Foundation's URL conforms to) described a URL's "host" as being:

The fully qualified domain name of a network host, or its IP address as a set of four decimal digit groups separated by "." [emphasis added]

Lots of implementations weren't very strict about that, and deferred to libc's inet_aton to parse/detect IPv4 addresses -- it does, after all, support the "dotted decimal" notation, but it also supports basically every other way you could think of to write a 32-bit integer. 0xbadf00d is a valid IPv4 address according to inet_aton. When the URL specification was updated by RFC-3986 16 years ago, this was specifically called out as not being within spec and a potential security concern. But as things stand, we're stuck with it for compatibility.

Every time we parse a URL, we need to check its hostname to see if it's an IPv4 address (v6 addresses are much simpler, since they are enclosed in [ ] brackets), so everybody is paying for this parser which is more complex than anybody ever intended it to be. FWIW, the WHATWG URL Standard rewrites these IP addresses, so if you parse http://0xbadf00d/ with WebURL, you'll get http://11.173.240.13/ back.

Karl · June 4, 2021, 2:36pm

I don't have any plans to add that support (AFAICT, even Foundation has deprecated support for them, and the only way to do it now is via CFURL using knowledge of enum raw values). It wouldn't be difficult to support if anybody was interested.

The API I have planned looks (very roughly) like this:

enum FilePathStyle {
  case windows
  case posix

  static var native: FilePathStyle {
    #if os(Windows)
      return .windows
    #else
      return .posix
    #endif
  }
}

extension WebURL {

  public init(fileURLFromPath: String, style: FilePathStyle = .native) throws

  func toFilePath(style: FilePathStyle = .native) throws -> String
}

So it could be extended to new kinds of paths, and all systems can test how every other system behaves (besides, sometimes you're handing the URL/path to a system which isn't the one you're running on).

Oh, they are; at least as bad as URLs, if not worse. It stands to reason that when file paths and URLs meet, it's not a fun time.

compnerd · June 4, 2021, 3:41pm

FYI, the C++ example for the NT style path uses \\??\ rather than \??\ which may be different?
That's not too surprising. IIRC, IE internally relied heavily on URI and URLMoniker which would allow filtering of the kernel paths. These really shouldn't be accessible from a browser's context.
I think that it entirely depends on the use case. If this type is being used to represent the system specific paths for a system application, it does make some sense. That said, I think that in general using an NT style path is ... difficult. At that point, I don't even know if you could easily resolve everything to the final path (e.g. \GLOBAL??\PhysicalDrive0 would be a symlink to \??\Device\Harddisk0\DR0, which is inaccessible without elevated privileges).

phoneyDev · June 5, 2021, 12:35am

One thing I'll mention is string encodings for file names. I know that HFS+ uses a particular version of unicode (Normalization Form D). When creating files the file name is normalized to that version. It's possible to create a file such that its name is similar to but not exactly the same bit pattern that you specified. I don't know if this is something important for your library. I believe that Foundation URL deals with this.

millenomi · June 5, 2021, 7:25am

In general, Foundation URLs do deal with this, and with much of what @compnerd is mentioning eg re: Windows paths, especially when converting to and from filesystem representations (that is, correctly encoded byte buffers for use with lower-level functions).

ETA: … by which I mean implementations are readily observable in Core Foundation, should one want to take a look. I missed the important bit.

Karl · June 5, 2021, 8:13pm

So, AFAICT, we shouldn't need to worry about this, and nobody who uses Swift's String type should need to worry about this, because it does Unicode-aware comparison instead of simple binary comparison. So if you create a file, the FS normalises it in whichever encoding it likes, and you iterate the directory, you should still be able to find it despite possible normalisation differences thanks to String.

Really, the only developers who need to worry about it are C/C++ developers using the non-Unicode-aware strcmp/strncmp, where one of the strings is from the filesystem and the other is not. In that case, the same guidance applies that always applied to binary comparisons of Unicode strings - either normalise both to the same form (NFD, NFC, or whatever other form you like), or normalise one to match the other (the non-FS string to match the filesystem normalisation, or the FS string to match the normalisation used in your application).

CFURL's withFileSystemRepresentation (which is just a scoped version of CFString's fileSystemRepresentation) gives you a copy of the path normalised to what it thinks the filesystem APIs will give you, which is some variant of NFD. Sometimes it will get it wrong, and if you're using Swift types, comparison will actually go through the least efficient path: Swift's String optimises for NFC, not NFD.

Basically: avoid doing binary comparisons of Unicode strings, and you don't need to care about this.

Karl · June 5, 2021, 8:20pm

My understanding is that the kernel does this for you, so you don't need to worry about normalisation when calling POSIX APIs. Also, I have indeed looked at CFURL's source code, and I don't believe it does handle NT object paths as being different from drive-relative paths.

EDIT: It doesn't handle NT object paths; in fact, it thinks all paths starting with a slash are absolute, when actually they are drive-relative:

Test code (Swift)

import Foundation

func testURL(_ path: String, comment: String? = nil) {
    if let comment = comment { print("DESC:", comment) }
    print("PATH:", path)
    let url = URL(fileURLWithPath: path).absoluteURL
    print("URL:", url)
    url.withUnsafeFileSystemRepresentation {
        guard let buffer = $0 else { print("FAILURE"); return }
        print("FSR:", String(cString: buffer))
    }
    print("----------")
}

testURL(#"."#, comment: "Relative path")
testURL(#"\foo"#, comment: "Drive-relative path")
testURL(#"\??\foo"#, comment: "NT path")
testURL(#"\D:\foo"#, comment: "Drive-relative path with drive in 1st component")
testURL(#"\D|\foo"#, comment: "Drive-relative path with drive in 1st component (alt)")
testURL(#"\??\D:\foo"#, comment: "NT path with drive")
testURL(#"\??\D|\foo"#, comment: "NT path with drive (alt)")
testURL(#"\foobar\D:\something"#, comment: "Drive-relative path with random first component, drive in 2nd component")
testURL(#"DriveName\foo"#, comment: "Volume name")
testURL(#"\DriveName\foo"#, comment: "Drive-relative path with volume name in 1st component")
getchar()

Output:

DESC: Relative path
PATH: .
URL: file:///C:/Users/User/code/Swift/
FSR: C:\Users\User\code\Swift
----------
DESC: Drive-relative path
PATH: \foo
URL: file:///foo
FSR: \foo
----------
DESC: NT path
PATH: \??\foo
URL: file:///%3F%3F/foo
FSR: \??\foo
----------
DESC: Drive-relative path with drive in 1st component
PATH: \D:\foo
URL: file:///D:/foo
FSR: D:\foo
----------
DESC: Drive-relative path with drive in 1st component (alt)
PATH: \D|\foo
URL: file:///D%7C/foo
FSR: D:\foo
----------
DESC: NT path with drive
PATH: \??\D:\foo
URL: file:///D:/foo
FSR: D:\foo
----------
DESC: NT path with drive (alt)
PATH: \??\D|\foo
URL: file:///%3F%3F/D%7C/foo
FSR: \??\D|\foo
----------
DESC: Drive-relative path with random first component, drive in 2nd component
PATH: \foobar\D:\something
URL: file:///D:/something
FSR: D:\something
----------
DESC: Volume name
PATH: DriveName\foo
URL: file:///C:/Users/User/code/Swift/DriveName/foo
FSR: C:\Users\User\code\Swift\DriveName\foo
----------
DESC: Drive-relative path with volume name in 1st component
PATH: \DriveName\foo
URL: file:///DriveName/foo
FSR: c:\foo
----------

(Note: The last 2 are because I gave my C drive the friendly label "DriveName" in explorer. Not only is this relatively expensive, but I hope your users don't give their volumes generic names like "Files" or "Documents"!)

My favourite one is how \foobar\D:\something gets turned in to file:///D:/something. Foundation seemed to be returning something reasonable-looking for \??\D:\foo, but I couldn't find any handling for that in the code - as it turns out, Foundation simply sees a drive letter in the second component, and just indiscriminately discards the first component

All of this is illustrative of why I think a URL library should, as much as possible, provide only an abstract model of a URL rather than getting too bogged-down automatically interpreting them. Rather than doing everything for you and being a jack of all trades, we'll learn from what I perceive to be CFURL's mistakes and delegate that to domain experts such as swift-system or the platform's native path APIs.

millenomi · June 6, 2021, 9:58pm

Fair: I hope the dual experiences can improve each other.

On that note, I’ll be filing a bug for the above today because it’s certainly not right.

Karl · June 7, 2021, 11:16am

So do I

I'd also be happy to discuss what more could be done to ensure Swift has great URL handling. RFC-1738 is from 1994, which was the year when:

Apple released its first PowerPC Macintosh, transitioning from the Motorola 68K
The Linux kernel hit version 1.0
Microsoft stopped selling MS-DOS
The W3C was founded
Netscape Navigator 1.0 was released

Undoubtedly, a lot has changed since then, and the URL standards have not held up very well (Foundation itself provides URLComponents, which actually conforms to a different RFC!). I made this library because I think a new type which matches how browsers and other systems interpret URLs today would be an asset for Swift (e.g. we now exactly match Safari, as well as platforms like Node.js and Rust). CFURL's code indicates that it dates back to at least 1998 -- it's doing very well all things considered! Especially thinking about everything it has to support (file reference URLs, security-scoped URLs, URLResourceKeys, etc). I'm sure there are things which might be done differently if it was being written today, and I'd certainly value that insight. There are not many teams who have maintained such a load-bearing URL library for such a long time.

The goal of this library is not to diminish Foundation, but to empower Swift. After file URLs, the next thing I'll be working on is interoperability with Foundation APIs like URLSession, which is likely to be even more challenging (if it's even possible with existing APIs), but I think it would add value to this library, Foundation, and Swift as a whole.

Karl · June 25, 2021, 6:22pm

OK so I'm still working on this.

It's been quite a slog, and I've rewritten and reorganised the test database several times in order to cover as many permutations as possible, as I think of new edge-cases, etc. Currently I'm at 170 tests for path -> URL conversion ...

I've noticed a catalogue of oddities with the Windows API functions. Broadly speaking, I want this function to work approximately the same as calling GetFullPathName (to normalize the path) followed by UrlCreateFromPath. The test are organised in a JSON file, so I'm able to run them against the Windows functions and examine the differences.

I'm quite sure that some of them are bugs in the Windows API; like, it totally breaks the given path or makes a URL that won't be interpreted correctly even by libraries using legacy URL standards. Some of them are cosmetic differences, but others are a bit more subtle and it's not clear what the correct thing to do is.

Apologies in advance, because this is quite long and involves intricate path details. Still, people who use Windows or know this area: I'd really appreciate your input.

[Windows bug] The Windows API functions fail to percent-encode certain symbols.

Summary

Example path: C:\~/foo\bar?/baz#.txt\Name;with%some symbols*#
Windows API: file:///C:/~/foo/bar%3F/baz#.txt\Name;with%some symbols*#
WebURL: file:///C:/~/foo/bar%3F/baz%23.txt/Name;with%25some%20symbols*%23

Windows fails to escape the # character, % character, or even spaces (!!!). That's a serious bug in the Windows API; the first URL will actually be interpreted as having a fragment, which many applications will discard when they process URLs (which is generally fine; the fragment is typically client-side info, not server-side info).
[Windows bug] UNC paths with localhost
Summary

Example path: \\localhost\SomeShare\Windows\System32\notepad.exe
Windows API: file:///SomeShare/Windows/System32/notepad.exe
WebURL: file://127.0.0.1/SomeShare/Windows/System32/notepad.exe (!!!)

Okay, this is a weird one. So, RFC-1738 and 3986 seem to imply that "localhost" is equivalent to an empty host, and the WhatWG URL Standard actually normalizes "localhost" to empty for file URLs. Unfortunately, they are not exactly equivalent when it comes to UNC file URLs - and you can see this from the Windows API's output.

The Windows API totally breaks this path. The URL file:///SomeShare/Windows/... won't round-trip to the same resource using the Windows API itself, Internet Explorer won't find it, Chrome and Edge won't know what to do with it, etc. It's just broken.

So, given that we need to preserve come kind of hostname, we have 2 reasonable options:
- DOS device path style; i.e. file://./UNC/localhost/SomeShare/.... Works on Chrome and Edge, but not IE.
- An alternative formulation of "localhost" the won't be nixed by the URL standard. I've chosen 127.0.0.1, which works on Chrome, Edge, and IE. It's not ideal, but it should be compatible with just about everything.
[Windows bug?] Share names are not trimmed, even if they have extra slashes before them.

Summary

Example path: \\my_pc\\\\share.\dir.\
Windows API: file://my_pc/share/dir/
WebURL: file://my_pc/share./dir/

So, as you can see, typically the Windows path normalization process trims a single trailing dot from the end of path components (dir. -> dir). UNC share names are an exception, so GetFullPathName normalizes the path \\my_pc\share.\dir.\ to \\my_pc\share.\dir\. The way normalization is documented, repeated slashes should be collapsed before trimming happens, but in practice it appears Windows does it the other way around - meaning in the above example, the share name gets trimmed even though it seems like it shouldn't.

Is this a bug? I don't know. I filed an issue on Microsoft's documentation site, but it goes to the .net project on GitHub, but they didn't know the answer because .net just delegates to the OS. They suggested I ask on StackOverflow

I'm quite sure it's a bug.
[Subtle behaviour difference] Different handling of extra leading slashes.
Summary

3 slashes:
Example path: ///foo/bAr/BaZ/qux.txt
Windows API: file:///foo/bAr/BaZ/qux.txt (host = "")
WebURL: file://foo/bAr/BaZ/qux.txt (host = "foo")

4+ slashes:
Example path: ////foo/bAr/BaZ/qux.txt
Windows API: file:///foo/bAr/BaZ/qux.txt (host = "")
WebURL: file://foo/bAr/BaZ/qux.txt (host = "foo")

So, this one is kinda weird. In my implementation, 3+ leading slashes are all collapsed and treated the same way as 2 slashes - so the first non-empty path component is the UNC hostname. Windows appears to have 2 conflicting ways of doing it:
- UrlCreateFromPath collapses 4+ slashes in to 2 slashes, so the first non-empty component is the UNC server name. However, it makes an exception for 3 slashes - which have an empty server name. So, according to UCFP:
  
  \\a\b\c -> file://a/b/c (host = "a")
  \\\a\b\c -> file:///a/b/c (host = "")
  \\\\a\b\c -> file://a/b/c (host = "a")
  \\\\\a\b\c -> file://a/b/c (host = "a")
- GetFullPathName collapses 4+ slashes in to 3 slashes; so the resulting normalized path always has an empty UNC server name. That's why the examples above all look like file URLs to local paths, because I'm testing against the combined output of GFPN followed by UCFP. According to GFPN:
  
  \\a\b\c -> \\a\b\c (host = "a")
  \\\a\b\c -> \\\a\b\c (host = "")
  \\\\a\b\c -> \\\a\b\c (host = "")
  \\\\\a\b\c -> \\\a\b\c (host = "")
IMO, UrlCreateFromPath's behaviour makes the most sense, except that the exception for 3 leading slashes doesn't, really. See, with 3 leading slashes, the path is /a/b/c, which will only round-trip back to a meaningful file path if component a is a drive letter (i.e. the path string is \\\C:\...). You end up with a really bizarre situation where the path flips between being relative, UNC, local, and back to UNC as you add slashes.

Instead, for leading slashes, we're following UrlCreateFromPath and collapsing them in to a UNC path with non-empty server name, but without the 3-slash exemption. It makes the most sense to me, but it could have security implications if Windows says \\\\\\foo\bar has an empty UNC host, but the file URL ends up with host foo.
[Cosmetic] Windows path normalization trims more characters than it is documented to do.

Summary

Example path: C:\foo \<U+0007><U+0009>bar<U+000A>
Windows API: file:///C:/foo%20/%07bar
WebURL: file:///C:/foo%20/%07%09bar%0A

Windows removes tabs and newlines from inside path components. This isn't part of the documented trimming procedure, and it seems to be a bit inconsistent about where it does it. Needs further investigation.
[Cosmetic] Windows sometimes removes the trailing slash from directory paths.

Summary

Example path: C:/foo/bar/.
Windows API: file:///C:/foo/bar
WebURL: file:///C:/foo/bar/

Again, this seems to be inconsistently applied. I think it's a cosmetic difference, so I'm not worried about it.
[Cosmetic] Windows drive letters are escaped in UNC paths.

Summary

Example path: \\.\C:\Windows\System32\notepad.exe
Windows API: file://./C:/Windows/System32/notepad.exe
WebURL: file://./C%3A/Windows/System32/notepad.exe

The WhatWG URL Standard interprets Windows drive letters in paths (e.g. if you parse a relative URL with the above as its base, it will have some special handling for the Windows drive). According to Microsoft's documentation, the root of a DOS device path like this is the \\.\ component, not the drive. Indeed, calling GetFullPathName on \\.\C:\foo\bar\..\..\..\..\.. returns \\.\, so it seems the correct thing to do is to escape the Windows drive so the URL standard won't try to be 'helpful'.

Issue 4 is the one I'm most concerned about. It has the potential to be a security vulnerability if the Windows API tells you a path has a UNC server name of "", but round-tripping through a file URL changes that. That being said, the Windows API functions don't seem to completely agree on how to handle extra leading slashes, either.

Some of these issues (e.g. 2 and 7) might be better-solved by making changes to the URL standard, to preserve localhost as a hostname and disable some special behaviours if the file URL has a host. I'm going to be asking for input from the WHATWG members, as well.

So yeah, apologies again because this does involve a lot of intricate, technical details. Still, I'd appreciate any advice: is there anything here that looks obviously wrong or unreasonable?

Michael_Ilseman · June 29, 2021, 2:16pm

This would be a great test case for FilePath's Windows semantics.

Would such a file URL be restricted to a platform "family" (meaning POSIX-ish or Windows)?

FilePath.lexicallyResolving is intended to help for this use case, so that the resolved subpath is guaranteed to be lexically contained in the outer. (symlinks may still cause escaping, but that's somewhat inherent to file systems and it's up to the server to have sane directory practices).

FilePath.isRelative returns true for all 3 of those cases. We've been wanting to add further root analysis APIs for Windows (but it's hard to prioritize against multi-platform functionality), so if you come up with anything in this space it could be interesting to incorporate into System.

This is an interesting operation. System provides, at its core, an OS interpretation of file paths with common functionality built on top of it (e.g. viewing components, lexical normalization and subpath nesting). Domain-specific concerns can be layered on top by the client library (e.g. SPM has a AbsolutePath type that they like to traffic primarily in, which is a wrapper around FilePath).

But, if a domain concern is common such that multiple clients can benefit from a shared implementation, semantics, and API, then it makes sense to consider incorporating into System. Is the problem of interpreting percent-encoded string content common outside of URL processing? E.g. could another library be concerned about those being present in a file path? Or, present in other system strings such as user/group names?

Michael_Ilseman · June 29, 2021, 3:20pm

We could consider combining efforts on making a better test suite for both System and WebURL. Here's a self-generating C# test I used when designing FilePath's Windows support: C# path examples · GitHub. It has some interesting corner cases involving the treatment of legacy DOS device paths.

Karl · June 29, 2021, 4:12pm

In practice, they would. It turns out that there are almost no file paths which are both absolute/fully-qualified and valid on both POSIX and Windows systems:

C:\Windows\ - absolute path on Windows, relative path on POSIX
/usr/bin/ - absolute path on POSIX, drive-relative path on Windows

UNC paths which happen to start with forward slashes (//mypc/share/dir) are the exception: they are always fully-qualified (never relative to your current working directory/drive), and POSIX says that 2 leading slashes are "implementation defined" (and some OSes/applications do indeed give them special meaning).

However, for POSIX paths, we wouldn't turn UNC paths in to file URLs with hostnames (apparently Chromium used to, but they stopped because they consider it a security risk); the extra slash would be kept as part of the path, e.g. file:////mypc/share/dir. Interestingly, those kinds of URLs wouldn't convert back in to file paths on Windows, because they don't begin with a drive letter.

Some of this is a consequence of me wanting this feature to be very strict; it's stricter than any other implementation I've seen. But I've been looking at examples of traversal attacks in the wild via reports on cvedetails.com, and I'm convinced that hiding an os.path.join or FilePath.push within this function isn't wise.

Yes, that's perfect for this!

You typically know a safe subtree for the application to access, and any symlinks that exist are ones that you intentionally put there (it's generally not so easy for an attacker to create symlinks to arbitrary locations... I think?), so this is just the right thing.

Users may still need to think about how they deal with relative paths, if they choose to use them. That said, most OSes seem to strongly discourage using relative paths at all (for thread safety). Microsoft outright says they aren't supported in multithreaded applications or shared libraries:

The current directory state written by the SetCurrentDirectory function is stored as a global variable in each process, therefore multithreaded applications cannot reliably use this value without possible data corruption from other threads that may also be reading or setting this value. This limitation also applies to the SetCurrentDirectory and GetFullPathName functions.
...
Using relative path names in multithreaded applications or shared library code can yield unpredictable results and is not supported.

AFAIK, it's only relevant to URLs. That point was about about the URL -> path conversion; in that case, we need to account for users manipulating the string at the URL level, and injecting stuff which may be decoded in a surprising way.

I'll definitely take a look at those, thanks! It could certainly make sense to have a common database of "weird paths", even if the results (a file URL, or the extracted path components) are library-specific.

I only get a couple of days per week to work on this, but I'm hoping to have it ready soon, then we can consider combining those test cases.

Karl · July 8, 2021, 1:10pm

I'm trying to figure out the encoding story for this, and... it's just awful. It isn't a functionality problem, because at the end of the day URLs can just percent-encode arbitrary bytes and accurately preserve the path, whatever its encoding, but there's a serious usability problem.

My first draft of the API adds the following functions:

// Path2URL
extension WebURL {

  public init<S: StringProtocol>(
    filePath: S, style: FilePathStyle = .native
  ) throws

  public static func fromFilePathBytes<Bytes: Collection>(
    _ path: Bytes, style: FilePathStyle = .native
  ) throws -> WebURL where Bytes.Element == UInt8
}

// URL2Path
extension WebURL {

  public func filePath(style: FilePathStyle = .native) throws -> String

  public static func filePathBytes(
    from url: WebURL, style: FilePathStyle = .native
  ) throws -> ContiguousArray<UInt8>
}

But after trying to write documentation for these functions, I'm reluctantly coming around to the idea that the String versions just aren't going to work; there'd be a bunch of intricate caveats telling developers not to use it in this case, or that case, etc. - most of which are difficult, if not impossible, for developers to predict - and that they should traffic in terms of arrays instead.

I mean, this is a snippet from my current attempt to document WebURL.fromFilePathBytes. I can't imagine an average developer reading this and understanding what they're supposed to do. I don't think it's because it is poorly-worded (or at least, that isn't the only reason); there is just inherent complexity that is difficult to smooth over:

/// ## Encoding
///
/// This function accepts its path as a `Collection` of bytes, which allows certain paths to be expressed more precisely than `String` allows.
///
/// ### POSIX
///
/// POSIX-style paths are typically considered semi-arbitrary byte sequences; path components are delimited by the ASCII forward-slash (`0x2F`),
/// a component consisting of one or two ASCII periods (`0x2E`) is interpreted as a reference to the current or parent directory, respectively,
/// and the ASCII null byte (`0x00`) is often considered to be the end of the byte sequence - but otherwise, file and directory names are just opaque bytes.
/// In practice, file and directory names are often UTF-8, but they may not be, and so creating a `String` of a filesystem path may corrupt it by replacing certain
/// bytes with replacement characters (`�`).
///
/// Besides the reserved bytes listed above, this function does not assume that file or directory names have any particular encoding or interpretation.
/// Any bytes which would be interpreted by URL semantics are preserved by percent-encoding, so they may be decoded to their original values.
/// Note that the same considerations apply when converting the file URL back to a path - if the encoded bytes could not be losslessly
/// represented as a Swift `String` _before_ conversion to a URL, that will still be the case when performing the reverse transformation.
/// Use `WebURL.filePathBytes(from:style:)` to obtain the precise bytes of the path without Unicode replacements performed by `String`.
///
/// Note that on macOS and other Darwin platforms (iOS, iPadOS, tvOS, etc.), as well as ChromeOS, paths are guaranteed to be valid UTF-8.
/// The exact sequence of bytes may vary as the operating system performs Unicode normalization on file and directory names, meaning the percent-encoded
/// bytes in their URL representations may also differ, but their paths can always be losslessly represented by Swift's `String` type,
/// and `String`'s Unicode-aware comparison will ensure that these paths compare as equal to each other.
///
/// ### Windows
///
/// Windows paths (since Windows NT) are natively UTF-16-LE and are _not_ opaque byte sequences. The platform APIs expose them both as sequences
/// of 16-bit code-units (via the `-W` APIs) and, for legacy reasons, as sequences of bytes transcoded to the system's active code page (via the `-A` APIs).
/// These latter APIs are fundamentally lossy, as the active code page typically cannot represent every Unicode character, so users should take care
/// to use the `-W` APIs when interfacing with the Windows filesystem, unless the active code page is known to be a Unicode encoding such as UTF-8.
/// Well-formed UTF-16 can be converted to UTF-8 and losslessly round-tripped back to the same sequence of UTF-16 code-units, so this is the recommended
/// way to create a file URL from a Windows path.
///
/// However, it would be _far too easy_ if everything was well-formed UTF-16; so to keep things interesting, Windows also allows ill-formed UTF-16,
/// such as unpaired surrogate code-points, in file and directory names. These simply cannot be expressed in UTF-8, and so, unfortunate
/// as it may be, the expression of these code-points in an 8-bit encoding is left as an exercise for the reader. [WTF-8][WTF-8] may be a good choice,
/// but since it is a relaxed/intentionally broken version of UTF-8, users are discouraged from passing such URLs around to other applications,
/// which may not be able to decode them correctly.
///
/// As with POSIX paths, this function only interprets a small number of ASCII byte values - the forward- and back-slashes (`0x2F` and `0x5C`),
/// period (`0x2E`), space (`0x20`), and colon (`0x3A`) - and percent-encodes any other bytes which may be interpreted by URL semantics.
/// For UNC paths, the server name must be valid UTF-8 as it may be subject to IDNA normalization, which requires valid Unicode text.

(Also, lots of documentation that I've found seems to suggest that the only reserved bytes for POSIX paths are / and NULL. I don't think that's strictly true - I got quite worried about ASCII periods and went digging through the Linux kernel, but it turns out that Linux also interprets ASCII . and .. path components)

If I cut the String initializer and String-returning function, I'm left with functions which traffic in arrays or collections of bytes, which is just a really poor API. It makes real-world usage incredibly awkward - to give one basic example, if developers want to pass the returned array to an imported C filesystem API, it would need to be null-terminated, but if they wanted to construct a String, they'd need to strip the trailing null or remember to use the String(cString:) initializer.

Which leads me to think that the best way forward is to wrap the array, to capture that it is a "sort-of-string". This brings me to swift-system, which fortunately already contains just such a type: SystemString. Imagine if the APIs described above returned a SystemString rather than a String or array of bytes; immediately they would be so much better. Not only does it have a descriptive name, it could be a currency type which could be used in filesystem operations directly, if we could share this with swift-system.

Unfortunately, even if SystemString was a public API, adding a dependency on swift-system would be too high a burden: SwiftPM does not support optional dependencies/cross-import overlays, so this would have to be an unconditional dependency, even for users who don't care about file paths. Even if I could add it as an optional dependency, the roadmap for swift-system includes all sorts of platform APIs (e.g. processes and signals, sockets, pthreads, even ttys), and I can't limit the dependency to this small piece of it.

Maybe it's completely out of the question, but - would it be possible to duplicate/move SystemString to a lower-level library that could be shared?

And if I could maybe push it a little bit further - what about splitting out FilePath? IIUC, it is purely lexical, so it doesn't require any actual platform APIs (although they could be added in swift-system). That would allow this library (and others) to return an even higher-level type, and enable software which can accurately create paths for remote systems.

@Michael_Ilseman what do you think?

Karl · July 28, 2021, 11:05am

PR is up: File paths by karwa · Pull Request #53 · karwa/swift-url · GitHub

It has taken a bit longer than planned, but I wanted to be sure that I'd thoroughly researched how other applications/libraries approach this to ensure compatibility. I spent a fair bit of time digging through Chromium, WebKit, rust-url (summary of findings from those 3), Foundation, and a bit of time looking at Firefox.

This implementation is different from any of those, but the results should be compatible - it is closest to rust-url, but is much stricter about which paths it accepts/produces, and more defensive about escaping characters which the URL standard might otherwise interpret (e.g. the POSIX path /C:/becomes file:///C%3A/, because some parts of the standard have quirks built-in for Windows drives. That's fine, but not really desirable when we're starting from a path which has no concept of drive letters). On the other hand, we do some limited path canonicalisation for Windows paths, which is not strictly necessary but is convenient and I think it's nice to have (e.g. Windows drops single trailing periods - C:\Windows\ and C:\Windows.\ point to the same place. With WebURL, they will also produce the same URL).

I believe we also have the most extensive test suite of any implementation. I'm hoping to use this as a basis for future standardisation work.

--

The PR contains the lowest-level, platform-neutral functions, which operate on a path as a mostly-opaque collection of bytes. They should accurately preserve the contents of paths in any filesystem-safe encoding with 8-bit code-units (ASCII, UTF-8, WTF-8, Latin-1, and any other EUC or ASCII-compatible encoding).

The only issue I'm still struggling with is the API, and specifically the types used to represent file paths.

FilePath is definitely the thing I want to use. However, swift-system only provides a small selection of very low-level interfaces (which is all it is designed for, to be fair). It is not a substitute for Foundation's friendlier filesystem APIs such as FileManager, and none of those APIs support FilePath. Holistically, the benefits of FilePath end up being small/non-existent because users will have to drop to String or C-strings to do much with them.
String requires unicode text, so it won't accurately represent some paths on legacy systems. OTOH, it will almost always be fine on modern systems, is well supported by Foundation APIs, and doesn't require an additional package dependency. It's really tempting.

Swift just lacks a filesystem API that combines the correctness of FilePath with the convenience of FileManager. You have to choose between correctness or convenience.

Currently I'm leaning towards FilePath, hoping that better filesystem APIs can be developed later, and using FilePath's built-in ability to degrade to a String to support Foundation (I say 'degrade' because it may corrupt the path, not to diss String ). Adding lower-level variants of Foundation's APIs which support FilePath will be important follow-up work, and I hope the Foundation team agree to include them.

The additional package dependency will require some module juggling. I'm considering reorganising them so that import WebURL gives you everything, but you can opt to have only the parser with no dependencies using import WebURLCore.

example	relative to
foo\bar	current directory, current volume
\foo\bar	current volume
D:foo\bar	current directory on given volume