Next steps: File URLs

Hi

The next feature I'm working on for WebURL is file URL <-> path conversion. This thread is to let you know what's happening, what the challenges are, and to discuss if there are any issues with the approach. I would be particularly grateful for any feedback on how I intend to handle Windows paths, as it's the trickiest platform and the one I have the least experience with.

The goal is that you can create a file URL from a local path, give it to an application which supports both local and remote resources, and that application will be able to reverse the transformation to end up at exactly the same local path.

The most challenging aspect of this is Windows support, as it has 3 different kinds of paths (DOS paths, UNC paths, and DOS device paths), each of which has sub-types with their own sets of challenges. I'm basing the implementation on Chromium's utilities, because there is a general consensus among participants in the URL Standard to align with Chrome when it comes to handling files on Windows. However, Chrome's implementation is not extremely well tested, so some aspects of its behaviour are not very well defined, and other aspects make it impractical for direct exposure to developers as API (e.g. you can create file URLs with hostnames on any platform, but those URLs will fail to round-trip back to paths on any platform but Windows). So some adjustments will have to be made and test coverage will have to be expanded.

I'd like for WebURL to incorporate modern best practices for security for reliability, and that means some kinds of paths won't be supported, sometimes depending on the platform:

  • Relative paths will not be supported.

    URLs cannot express relative paths - for some URL file:///foo, there is no way to say that foo should be resolved relative to the working directory of application processing the URL. This means that when creating a file URL from a path, we would need to resolve it in to an absolute path.

    Automatically resolving such paths has the potential to leak internal server/application details - for instance, if some part of the path contains data derived from user input, it may be possible to escape the intended filesystem subtree so that the resolved path points to an internal configuration file. This is known as a directory traversal attack.

    In order to help mitigate these kinds of errors, it will not be possible to create a file URL from a relative path. No automatic resolution will be performed. Similarly, when creating a path from a file URL, if the result would be interpreted by the operating system as a relative path, the function will fail and throw an error instead.

    If developers need to create a file URL from a relative path, they can use libraries such as swift-system or utilities provided by the operating system, such as realpath, to resolve the path relative to whichever base directory is appropriate. Those libraries can also provide utilities to check that the resulting path does not escape the intended subtree. I feel that there is value in keeping this as an explicit, separate step for applications which deal in relative paths.

    On POSIX, this is relatively straightforward. For path -> URL conversion, we reject paths which do not begin with a "/". For the reverse transformation (URL -> path), we don't need to do anything, since file URLs are assumed to have absolute paths and always start with a "/".

    On Windows, things are more complicated. Fortunately, of the 3 kinds of paths, only DOS-style paths can be relative. Unfortunately, DOS-style paths themselves support 3 kinds of 'relativeness':

    example relative to
    foo\bar current directory, current volume
    \foo\bar current volume
    D:foo\bar current directory on given volume

    Note that even though the last example includes a volume, the path after it does not begin with a '/', so it is considered relative to the current directory of the D: volume (every volume can have a different "current directory" on Windows). Leaving out that slash is a common source of bugs.

    I intend that creating a file URL from any of these 3 paths, or creating any of these paths from a file URL, would fail on Windows. This means that, for instance, the URL file:///foo/bar can be turned in to a path on POSIX systems, but not on Windows systems.

  • Creating a POSIX path from a file URL with a hostname will not be supported.

    There is no obvious interpretation of a file host on POSIX systems, where remote filesystems are typically mapped to the local filesystem. We could just ignore the host, but RFC-8089 discourages this and Chrome stopped doing it because it could lead to spoofing issues (e.g. file://accounts.google.com/some/local/file and file:///some/local/file would display the same content). File hosts will be supported on Windows, where they will result in UNC paths.

  • Creating a path from a URL containing percent-encoded directory separators will not be supported.

    For instance, file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd is a valid URL. However, when we create a file path from a URL, we need to decode all percent-encoding (percent-encoding has no meaning in FS paths). This would result in the path /some/folder/../../../../../etc/passwd which again, can leave users open to directory traversal attacks. We're following Chrome here, which determined that this was a security risk.

    Note that percent-encoded backslashes (%5C) are allowed on POSIX systems, as the OS does not consider them to be path separators. Windows supports both, so we need to block both on Windows.

It's worth noting that this behaviour is more restrictive than Foundation's URL type, which automatically resolves relative paths using the current process' working directory and allows percent-encoded directory separators. For the reasons given above, I consider both of these to be security risks.

Examples of Foundation doing things I think it probably shouldn't be doing
// Example: Foundation automatically resolves relative paths.
import Foundation

// CWD = /home/myservice/
let templateName = "../../../etc/passwd"
URL(fileURLWithPath: "templates/\(templateName)").absoluteURL // file:///etc/passwd
// Example: Foundation decodes pct-encoded separators
import Foundation

let url = URL(string: "file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd")!
print(url) // file:///some/folder/..%2F..%2F..%2F..%2F../etc/passwd
url.withUnsafeFileSystemRepresentation {
  print(String(cString: $0!)) // /some/folder/../../../../../etc/passwd
}

(Note that in the latter case, .standardizedFileURL will decode and simplify the URL to file:///etc/passwd before it gets to the OS, but the regular .standardized will not. So there is technically a way around it, but you need to know precisely what you need and go looking for it).

My hope is that this strikes a pragmatic balance, allowing as many file paths as we can reasonably and reliably support, while learning from some of the security risks that browsers have encountered and improving on the APIs that Swift developers currently have at their disposal. Each of these failure conditions will throw descriptive errors, so if they occur in your application, you'll have some guidance on what went wrong and how you can fix it.

Currently, implementation is blocked by some changes I'd like to make to the URL Standard in order to better support UNC hostnames, but I may release a POSIX-only update if it takes too long to make progress on that issue.

So that's it. What do you think - are these safeguards too restrictive? Have you ever been affected by a directory traversal attack? Or a bug stemming from Foundation's automatic relative path resolution? Do you know of any other edge-cases to consider? It's all useful signal, so I'd appreciate the feedback.

Thanks for reading!

:v::duck:

6 Likes

Some interesting links for the curious:

  • MSDN via archive.org: The bizarre and unhappy story of file URLs

    No one had been quite as abused as the the little file: URL. This URL was special because we had always used files and DOS paths (and no one at the time knew about path canonicalization attacks), everyone was quite sure what they looked like , acted like, and even tasted like. It didn't help that the file: protocol remained in RFC limbo as a platform/OS specific protocol. So the browser and the browser's little friends would take turns dressing a DOS path like an URL in a pink bunny suit and undressing the URL with a pair of rusty scissors, pretending it was the same DOS path they started with. Only the simplest of URLs was able to withstand this abuse, and it soon became clear that something would have to be done, lest the little file: URLs go off on their own and be lost forever.

  • Microsoft: CreateURLMoniker considered harmful

    CreateURLMoniker does a couple of horrible things to file URIs that if misused can lead to security bugs.

  • Microsoft: File URIs in Windows

    Invalid file URIs are among the most common illegal URIs that we were forced to accommodate in IE7. [...] The standard for the file scheme doesnโ€™t give specific instructions on how to convert a file system path for a specific operating system into a file URI. While the standard defines the syntax of the file scheme, it leaves the conversion from file system path to file URI up to the implementers.

    Since thereโ€™s no way to include an IPv6 address in a Windows file path, thereโ€™s no corresponding file URI and so thereโ€™s no way to incorporate an IPv6 address in file URIs in Windows.

  • Wikipedia: File URI Scheme

  • RFC-8089: The "file" URI Scheme (really not as helpful as you'd hope it would be)

[EDIT: And another one. cvedetails.com actually has a category for directory traversal attacks, they are that common]

  • CVEDetails: Directory Traversal Attacks

    Some of these are great (terrible, but great)

    IBM Host firmware for LC-class Systems could allow a remote attacker to traverse directories on the system. An attacker could send a specially-crafted URL request that would allow them to delete arbitrary files on the system.

    In Django 2.2 before 2.2.21, 3.1 before 3.1.9, and 3.2 before 3.2.1, MultiPartParser, UploadedFile, and FieldFile allowed directory traversal via uploaded files with suitably crafted file names.

    A path handling issue was addressed with improved validation. This issue is fixed in macOS Big Sur 11.0.1. A remote attacker may be able to modify the file system. [-- yes, it's macOS! I can't find many details, but it's CVE-2020-27896]

2 Likes

Unless you are categorizing things differently, this isn't entirely correct. There are additional path types :sweat_smile:

DOS Path: C:\Users\compnerd\Desktop
UNC Path: \\hostname\C$\Users\compnerd\Desktop
DOS Device Path: \\?\Volume{GUID}\Users\compnerd\Desktop
NT Object Path: \??\C:\Users\compnerd\Desktop, \??\UNC\hostname\C$\Users\compnerd\Desktop

Of course you have the various forms in each, but there is at least a fourth category of paths that you are overlooking. Those all do reference the same location assuming the hostname and GUID are properly replaced.

Out of curiosity, do you intend to support HFS paths? (Not HFS+, HFS paths of the form Macintosh HD:Users:compnerd:Desktop).

File paths are miserable.

5 Likes

Hmm, so I've been going by the documentation I linked to, which doesn't mention NT object paths. Thanks for letting me know about them!

Given that NT paths begin with a \, without any special handling they would be rejected for looking like a relative path. After spending a bit of time reading up on them and doing some experiments in a Windows VM, I think that is probably the correct behaviour (although I'll definitely be adding explicit tests for this style of path), for a couple of reasons:

  1. Windows itself doesn't always handle these paths correctly.

    As detailed by the Project Zero blog, calling GetFullPathName with an NT object path will treat it like a "rooted" (i.e. drive-relative) path. I tested this in the VM, to see if anything had changed since 2016 when that post was written, but it appears that it hasn't:

    Example (C++)
    #include <iostream>
    #include <windows.h>
    #pragma comment(lib, "Kernel32.lib")
    
    void testPath(LPCSTR path) {
        char out[MAX_PATH + 1];
        DWORD len = GetFullPathNameA(path, MAX_PATH, out, NULL);
        out[len] = '\0';
        std::cout << "PATH:      " << path << std::endl;
        std::cout << "FULL PATH: " << out << std::endl;
    }
    
    int main() {
        testPath("\\??\\D:\\foo");
    }
    

    Output:

    PATH:      \??\D:\foo
    FULL PATH: c:\??\D:\foo
    

    We don't have the source code for Windows, of course, but we can take a peek at the source code for .Net, which uses similarly basic logic (i.e. considering all paths which start with a leading slash to be "rooted", drive-relative paths). There is also some code which suggests that \??\ is equivalent to \\?\. But quite often, I see Microsoft's own APIs treating these like relative paths, so it seems quite reasonable to do the same and ultimately reject them, as we do for other kinds of relative paths. I'm open to suggestions, though, if you think this is important.

  2. Neither Edge nor Internet Explorer support NT object paths in file URLs

    The Windows API UrlCreateFromPathA encodes the ?? component directly in the path, as you might expect: \??\X:\ABC -> file:///%3F%3F/X:/ABC, and PathCreateFromUrlA can successfully recreate the path from such a URL. That said, I tried in both Edge and Internet Explorer, and neither would open this URL. So there don't seem to be any legacy/compatibility concerns, AFAICT.

    FWIW, I also tried encoding the ?? component as a hostname, but that seemed to be even worse: PathCreateFromUrlA turned the URL file://%3F%3F/C:/Windows in to the clearly wrong UNC path \\??C:\Windows. Again, neither Edge nor Internet Explorer knew what to do with it.

  3. It's not clear that it even makes sense to make file: URLs from these paths

    These paths represent very low-level objects, used in very specific contexts and only supported by specific APIs. I'm not sure it really makes sense to turn them in to URLs in the first place (or at least not file: URLs - perhaps a custom scheme?). It may actually be a good thing for that operation to fail; to serve as a signal to developers that they are probably not creating the URL they expected, and that they should use the platform APIs first to resolve them to a DOS/UNC-style path.


This raises an interesting point about the value in limiting the set of inputs which your APIs support. Even if we don't support every single kind of path that could possibly exist on Windows, I don't think that should necessarily be considered a flaw. It makes a good deal of sense to limit the inputs to paths which we (or I) can reasonably understand, test, and maintain, and to do our absolute best to reject every other kind of input. We'd get better guarantees if paths were strongly typed, but being strings, the only option is to have very specific predicates for what is allowed and to reject anything we don't recognise.

One of the reasons URLs are in such a sorry state is that implementors were often too lenient (or just didn't care) about supporting more than was absolutely required. This means that we now have decades worth of obscure legacy features which make URLs harder and slower to parse or reason about, and which hardly anybody actually uses, but which can't actually be removed because of compatibility concerns.

One example: IPv4 address. RFC-1738 (the 1994 standard which Foundation's URL conforms to) described a URL's "host" as being:

The fully qualified domain name of a network host, or its IP address as a set of four decimal digit groups separated by "." [emphasis added]

Lots of implementations weren't very strict about that, and deferred to libc's inet_aton to parse/detect IPv4 addresses -- it does, after all, support the "dotted decimal" notation, but it also supports basically every other way you could think of to write a 32-bit integer. 0xbadf00d is a valid IPv4 address according to inet_aton. When the URL specification was updated by RFC-3986 16 years ago, this was specifically called out as not being within spec and a potential security concern. But as things stand, we're stuck with it for compatibility.

Every time we parse a URL, we need to check its hostname to see if it's an IPv4 address (v6 addresses are much simpler, since they are enclosed in [ ] brackets), so everybody is paying for this parser which is more complex than anybody ever intended it to be. FWIW, the WHATWG URL Standard rewrites these IP addresses, so if you parse http://0xbadf00d/ with WebURL, you'll get http://11.173.240.13/ back.

1 Like

I don't have any plans to add that support (AFAICT, even Foundation has deprecated support for them, and the only way to do it now is via CFURL using knowledge of enum raw values). It wouldn't be difficult to support if anybody was interested.

The API I have planned looks (very roughly) like this:

enum FilePathStyle {
  case windows
  case posix

  static var native: FilePathStyle {
    #if os(Windows)
      return .windows
    #else
      return .posix
    #endif
  }
}

extension WebURL {

  public init(fileURLFromPath: String, style: FilePathStyle = .native) throws

  func toFilePath(style: FilePathStyle = .native) throws -> String
}

So it could be extended to new kinds of paths, and all systems can test how every other system behaves (besides, sometimes you're handing the URL/path to a system which isn't the one you're running on).

Oh, they are; at least as bad as URLs, if not worse. It stands to reason that when file paths and URLs meet, it's not a fun time.

  1. FYI, the C++ example for the NT style path uses \\??\ rather than \??\ which may be different?
  2. That's not too surprising. IIRC, IE internally relied heavily on URI and URLMoniker which would allow filtering of the kernel paths. These really shouldn't be accessible from a browser's context.
  3. I think that it entirely depends on the use case. If this type is being used to represent the system specific paths for a system application, it does make some sense. That said, I think that in general using an NT style path is ... difficult. At that point, I don't even know if you could easily resolve everything to the final path (e.g. \GLOBAL??\PhysicalDrive0 would be a symlink to \??\Device\Harddisk0\DR0, which is inaccessible without elevated privileges).

One thing I'll mention is string encodings for file names. I know that HFS+ uses a particular version of unicode (Normalization Form D). When creating files the file name is normalized to that version. It's possible to create a file such that its name is similar to but not exactly the same bit pattern that you specified. I don't know if this is something important for your library. I believe that Foundation URL deals with this.

In general, Foundation URLs do deal with this, and with much of what @compnerd is mentioning eg re: Windows paths, especially when converting to and from filesystem representations (that is, correctly encoded byte buffers for use with lower-level functions).

ETA: โ€ฆ by which I mean implementations are readily observable in Core Foundation, should one want to take a look. I missed the important bit.

So, AFAICT, we shouldn't need to worry about this, and nobody who uses Swift's String type should need to worry about this, because it does Unicode-aware comparison instead of simple binary comparison. So if you create a file, the FS normalises it in whichever encoding it likes, and you iterate the directory, you should still be able to find it despite possible normalisation differences thanks to String.

Really, the only developers who need to worry about it are C/C++ developers using the non-Unicode-aware strcmp/strncmp, where one of the strings is from the filesystem and the other is not. In that case, the same guidance applies that always applied to binary comparisons of Unicode strings - either normalise both to the same form (NFD, NFC, or whatever other form you like), or normalise one to match the other (the non-FS string to match the filesystem normalisation, or the FS string to match the normalisation used in your application).

CFURL's withFileSystemRepresentation (which is just a scoped version of CFString's fileSystemRepresentation) gives you a copy of the path normalised to what it thinks the filesystem APIs will give you, which is some variant of NFD. Sometimes it will get it wrong, and if you're using Swift types, comparison will actually go through the least efficient path: Swift's String optimises for NFC, not NFD.

Basically: avoid doing binary comparisons of Unicode strings, and you don't need to care about this.

My understanding is that the kernel does this for you, so you don't need to worry about normalisation when calling POSIX APIs. Also, I have indeed looked at CFURL's source code, and I don't believe it does handle NT object paths as being different from drive-relative paths.

EDIT: It doesn't handle NT object paths; in fact, it thinks all paths starting with a slash are absolute, when actually they are drive-relative:

Test code (Swift)
import Foundation

func testURL(_ path: String, comment: String? = nil) {
    if let comment = comment { print("DESC:", comment) }
    print("PATH:", path)
    let url = URL(fileURLWithPath: path).absoluteURL
    print("URL:", url)
    url.withUnsafeFileSystemRepresentation {
        guard let buffer = $0 else { print("FAILURE"); return }
        print("FSR:", String(cString: buffer))
    }
    print("----------")
}

testURL(#"."#, comment: "Relative path")
testURL(#"\foo"#, comment: "Drive-relative path")
testURL(#"\??\foo"#, comment: "NT path")
testURL(#"\D:\foo"#, comment: "Drive-relative path with drive in 1st component")
testURL(#"\D|\foo"#, comment: "Drive-relative path with drive in 1st component (alt)")
testURL(#"\??\D:\foo"#, comment: "NT path with drive")
testURL(#"\??\D|\foo"#, comment: "NT path with drive (alt)")
testURL(#"\foobar\D:\something"#, comment: "Drive-relative path with random first component, drive in 2nd component")
testURL(#"DriveName\foo"#, comment: "Volume name")
testURL(#"\DriveName\foo"#, comment: "Drive-relative path with volume name in 1st component")
getchar()

Output:

DESC: Relative path
PATH: .
URL: file:///C:/Users/User/code/Swift/
FSR: C:\Users\User\code\Swift
----------
DESC: Drive-relative path
PATH: \foo
URL: file:///foo
FSR: \foo
----------
DESC: NT path
PATH: \??\foo
URL: file:///%3F%3F/foo
FSR: \??\foo
----------
DESC: Drive-relative path with drive in 1st component
PATH: \D:\foo
URL: file:///D:/foo
FSR: D:\foo
----------
DESC: Drive-relative path with drive in 1st component (alt)
PATH: \D|\foo
URL: file:///D%7C/foo
FSR: D:\foo
----------
DESC: NT path with drive
PATH: \??\D:\foo
URL: file:///D:/foo
FSR: D:\foo
----------
DESC: NT path with drive (alt)
PATH: \??\D|\foo
URL: file:///%3F%3F/D%7C/foo
FSR: \??\D|\foo
----------
DESC: Drive-relative path with random first component, drive in 2nd component
PATH: \foobar\D:\something
URL: file:///D:/something
FSR: D:\something
----------
DESC: Volume name
PATH: DriveName\foo
URL: file:///C:/Users/User/code/Swift/DriveName/foo
FSR: C:\Users\User\code\Swift\DriveName\foo
----------
DESC: Drive-relative path with volume name in 1st component
PATH: \DriveName\foo
URL: file:///DriveName/foo
FSR: c:\foo
----------

(Note: The last 2 are because I gave my C drive the friendly label "DriveName" in explorer. Not only is this relatively expensive, but I hope your users don't give their volumes generic names like "Files" or "Documents"!)

My favourite one is how \foobar\D:\something gets turned in to file:///D:/something. Foundation seemed to be returning something reasonable-looking for \??\D:\foo, but I couldn't find any handling for that in the code - as it turns out, Foundation simply sees a drive letter in the second component, and just indiscriminately discards the first component :man_shrugging:

All of this is illustrative of why I think a URL library should, as much as possible, provide only an abstract model of a URL rather than getting too bogged-down automatically interpreting them. Rather than doing everything for you and being a jack of all trades, we'll learn from what I perceive to be CFURL's mistakes and delegate that to domain experts such as swift-system or the platform's native path APIs.

Fair: I hope the dual experiences can improve each other.

On that note, Iโ€™ll be filing a bug for the above today because itโ€™s certainly not right.

1 Like

So do I :slight_smile:

I'd also be happy to discuss what more could be done to ensure Swift has great URL handling. RFC-1738 is from 1994, which was the year when:

  • Apple released its first PowerPC Macintosh, transitioning from the Motorola 68K
  • The Linux kernel hit version 1.0
  • Microsoft stopped selling MS-DOS
  • The W3C was founded
  • Netscape Navigator 1.0 was released

Undoubtedly, a lot has changed since then, and the URL standards have not held up very well (Foundation itself provides URLComponents, which actually conforms to a different RFC!). I made this library because I think a new type which matches how browsers and other systems interpret URLs today would be an asset for Swift (e.g. we now exactly match Safari, as well as platforms like Node.js and Rust). CFURL's code indicates that it dates back to at least 1998 -- it's doing very well all things considered! Especially thinking about everything it has to support (file reference URLs, security-scoped URLs, URLResourceKeys, etc). I'm sure there are things which might be done differently if it was being written today, and I'd certainly value that insight. There are not many teams who have maintained such a load-bearing URL library for such a long time.

The goal of this library is not to diminish Foundation, but to empower Swift. After file URLs, the next thing I'll be working on is interoperability with Foundation APIs like URLSession, which is likely to be even more challenging (if it's even possible with existing APIs), but I think it would add value to this library, Foundation, and Swift as a whole.

6 Likes
Terms of Service

Privacy Policy

Cookie Policy