SE-0529: Add FilePath to the Standard Library

This is the justification I was missing in the proposal, thank you. I think it would be good to have most of this in the standard docs for FilePath as well, so that when someone like me asks "what is all this nonsense" they can scroll back up and see "oh, that's what".

(I'm still not fully convinced it's a model that's going to be any more useful in practice, but it does explain where the choices came from.)

1 Like

I worry here a bit about the basic API going too deep into platform-specific rabbitholes. In the back of my mind, I keep thinking of Common Lisp's pathname type, which ended up an overengineered pile of special fields trying to accommodate the quirks of every 60s and 70s file system, with the paradoxical result that the type itself became a portability hazard because implementations could not agree on how to map path concepts from newfangled operating systems like DOS and Mac OS into fields of the standardized type. This proposal is a far way away from that, and the world of path designs has thankfully collapsed quite a bit these days. To me, the proposed design's "anchor" field, with the definition as the level below which you can no longer "..", seems like a decent lowest common denominator for the "universal" non-platform-specific part of the interface, since it usefully covers Windows drive letters, UNC paths, and magic \\.\/\\?\ prefixes, special Darwin path prefixes, and most of those cases you listed out as well, with a well-defined rule for what it means.

7 Likes

Case in point, “device name” is exactly the wrong name to use on Windows. That would be PhysicalDisk0, which is the entry in the \Devices section of the object manager that the \Global??\C: symlink points to.

2 Likes

I'd be inclined to call PhysicalDisk0 the "NT device name", as opposed to "C:" which is the "DOS device name". My point was solely that lots of systems have a device name; it's only really CP/M family operating systems (including DOS and Windows) that have a "drive letter".

(I remain surprised that Microsoft hasn't added support for extended DOS device names and is therefore still restricted to 26 drive letters. Systems with 26 separate drives are I suppose less common, though Windows environments commonly take up drive letters for network shares as well, so it's fairly easy to hit the limit.)

However, @Joe_Groff makes a good point that maybe the anchor property already covers this kind of thing, and I certainly wouldn't want to end up with the problem Common Lisp had.

1 Like

NUL, COM1 are also DOS device names. Technically anything in \\.\DosDevices\ would be a DOS device name. In that model, C: is the device name, which makes it annoying as I believe that Win32 doesn't model this in the same manner and treats C as the drive letter, : as the drive separator, and then a drive relative path. However, I do believe that Anchor is the right model.

1 Like

Hi all, I've refreshed the proposal PR with the latest round of feedback: swiftlang/swift-evolution#3292.

Brief change list:

  • Anchor.driveLetter is now Unicode.Scalar? (was Character), and both driveLetter and isVerbatimComponent are now Windows-only (#if os(Windows)). Unicode.Scalar is the narrowest candidate still expressible as and comparable against a literal (anchor.driveLetter == "C") and converts cleanly to both FilePath.CodeUnit and String. The letter is presented as written, without case normalization; an unpaired surrogate yields U+FFFD. A new Alternatives Considered subsection covers the reasoning.
  • FilePath.separator is now FilePath.CodeUnit (was Character), since it's a platform-level constant more likely to feed byte-level parsing than literal comparison.
  • The closure-based C-interop accessor is renamed withCodeUnits(_:) and now passes the code-unit count excluding the NUL terminator (thanks @glessard). nullTerminatedCodeUnits is retained as a span alongside it.
  • New "Path syntax versus filesystem interpretation" subsection in Path decomposition, covering the kernel-parses-syntax versus driver-interprets-opaque-bytes split that @jrose and @Joe_Groff asked about, contrasting .zfs/snapshots/<id> against Darwin's /.nofollow/.
1 Like

A question that comes to mind based on these (reasonable) changes is:

If Unicode.Scalar is ergonomic and converts cleanly to both CodeUnit and String, then could FIlePath.separator also be a Unicode.Scalar, allowing us not to have to choose its type based on what it's "likely" to be used for?

So, for clarity, the code-unit count passed to the closure for withCodeUnits(_:) will be exactly one less than nullTerminatedCodeUnits.count? (Very reasonable, just want to state the invariant out loud.)

driveLetter is a dynamic property that is queried from a path and likely to be switched over, compared, or printed/dumped out. There's very high value to having it in a string-convertible type and it's less likely someone will be using that property for byte splicing. If they do want to do byte splicing or low-level parsing, they can do numericCast(scalar.value) to get the FP.CodeUnit (though since it only applies on Windows they may even just use UInt16 directly).

In contrast, FilePath.separator is a static member determined by the platform and doesn't vary based on the specific path instance. It is much more likely to be used in byte splicing or the other very low-level interfaces. So we are proposing it be FilePath.CodeUnit directly.

Correct. It is strlen without having to call strlen.

2 Likes

If we're abandoning the withCString conceit, why not just use a buffer pointer then?

1 Like

I thought I should clean up and write down my thoughts on why I think it's important not to add resolve: In the thread it appears like there is a security requirement to have resolve.

Bottom line is that I'm very surprised if a security team would ask for the inclusion of resolve. My opinion is that almost all uses of resolve will result in a security vulnerability. Why: The result of resolve is pointless because it's a snapshot at a point in time, if you use it for anything at all, it may be wrong already because the world of the file system will have changed underneath.

Hence, I would recommend to not include resolve on accounts of two reason:

  1. Security: Using resolve is almost always a bad idea IMHO. Let's not include it as the only API we have to talk to the file system
  2. API: FilePath advertises itself as a syntactical type. I love that and this is IMHO the right choice. But then, it has resolve. Even if resolve were fine from a security point of view, I would still be against adding it to FilePath: It's a file system operation and this needs to be clear to a user. They won't expect this to involve the file system. Worse, file systems are one of the most commonly injected/mocked pieces of code (even clang has VFS etc). So if we think (I don't but let's ignore that) that resolve should exist at all, it should be on something like FileSystem.shared.resolve(path) (where FileSystem.shared would be the actual file system). For software that supports VFSes like clang, you'd have func doStuff(fileSystem: FileSystem) { fileSystem.resolve(path) } (i.e. accept an injected fileSystem).

I think both of these important concerns. And IMHO the solution is very straightforward: Let's add FilePath as a purely syntactical type ASAP. Then, in a separate proposal, we can discuss APIs that touch the file system, they all have tons of their own complexity (sync vs. async; real FS vs. injected VFS; ...) that ought to not be hidden in the middle of a syntactical FilePath proposal.

13 Likes

So, it would be reasonable to remove support for collapsing empty, parent, self components in the path? i.e. you could not take /.././/usr/bin and have it form /usr/bin.

To clarify, resolve using the filesystem is the result of Michael collaborating with security people. The original Path had a lexical-only resolve, which is a pattern that causes security issues. It would be a security disservice to encourage people to use a lexical-only resolve for operations ultimately targeting the filesystem.

2 Likes

The basic question is should such an API exist, regardless of whatever design or spelling it takes. The unambiguous answer from Apple's security team is "yes".

The opinion that path resolution is insecure is not one that Apple's security team shares. Quoting them directly:

Resolving a path is a required part of a symlink TOCTOU vulnerability fix for path-based code. Furthermore, by having this resolve method in the standard library, it also presents the opportunity for platforms that do support the /.nofollow path prefix to enforce the symlink-following-prevention part of the TOCTOU fix robustly by default, regardless of whether the developer remembers to use the O_NOFOLLOW_ANY-equivalent flag on all uses of the resolved path.

It is true that TOCTOU concerns are inherent to paths, as it is to any form of shared mutable state (such as comparing two integers loaded from mutable memory). Not having a resolve API means developers will have to do it themselves, and they almost certainly will do so incorrectly or sub-optimally, possibly with issues above and beyond any inherent TOCTOU concerns.

The second question is how it is presented or spelled. I.e. a method on FilePath vs a namespace or some other form.

In reading your summary, I want to make sure we address these questions separately and not conflate our reasoning. Are you opposed to the existence of path resolution API, or are you arguing against the design/spelling/presentation of that API?

2 Likes

I don’t object to the usefulness of resolving a file path against the file system, that is functionality that is commonly used and there should be a high-quality, standard solution instead of ad-hoc implementations.

The simple resolve() name is part of the confusion in this discussion. Both file-system-level and lexical resolution have their use cases and different people will assume different behavior when reading that name.

The method’s proposed placement on a type that does otherwise not interact with the file system favors the lexical reading while the required try favors the file system reading. Both of those are weak signals. I think that if this method is added to FilePath, it needs to be clear to the reader (and autocomplete accepter) that file system access is happening.

I dislike URL as I cannot remember which API is accessing the file system and which is not. Instead, FilePath without resolve() is pure and easy to think about.

4 Likes

I am curious what classes of TOCTOU bugs are addressed by a resolve() operation that returns a path instead of a file descriptor. Should these clients really be doing a resolveAndOpen()?

1 Like

So, it would be reasonable to remove support for collapsing empty, parent, self components in the path? i.e. you could not take /.././/usr/bin and have it form /usr/bin.

I don't think I would need this collapsing functionality, neither syntactical nor resolved.

Resolving a path is a required part of a symlink TOCTOU vulnerability fix for path-based code. Furthermore, by having this resolve method in the standard library, it also presents the opportunity for platforms that do support the /.nofollow path prefix to enforce the symlink-following-prevention part of the TOCTOU fix robustly by default, regardless of whether the developer remembers to use the O_NOFOLLOW_ANY-equivalent flag on all uses of the resolved path.
It is true that TOCTOU concerns are inherent to paths [...]

Right. The actual fix on all platforms is to not introduce a TOCTOU by not pre-resolving a path in the first place.

Now, for code that does suffer from the pre-resolution TOCTOU but wants to not on top of that suffer from the symlink-TOCTOU may if they happy to be on very new Darwins (I think it's macOS Tahoe 26.0+) use resolve + /.nofollow/.

To me, this is a pretty weak story: It's a very partial 'fix' that works on new Darwins only and will fall flat on its face on all other platforms. Now, if we pitched to add this functionality into the Darwin (or even swift-system) module where a (strawman) FileSystem.resolveAndStopFollowingSymlinks(path) which would resolve AND automatically prepend /.nofollow, then I think we got an argument here. Whereas the (arguably more cumbersome) NO_FOLLOW[_ANY] works on all platforms. But we don't even have APIs to open files in the stdlib, so the point is kinda moot.

But adding this functionality which is dangerous on all platforms (but can be used in a slightly less dangerous way on new Darwins) into the standard library is just not right IMHO.

Also: This would be the very first API to do anything with a file system in the standard library. So I think any security arguments on this are misguided. In fact we are making it look like it is a good idea to resolve file paths -- it's not, it's racy. If you need it secure, you've got to open the file/directory (APIs we don't have).

In reading your summary, I want to make sure we address these questions separately and not conflate our reasoning. Are you opposed to the existence of path resolution API, or are you arguing against the design/spelling/presentation of that API?

  • I'm opposed to the spelling (i.e. that it's on FilePath)
  • Separately, I'm opposed to starting with this API. It may have its place but reading/writing files/directories etc feels 100x more important than resolving a path (again, not a good idea)
  • Thirdly, all of the file system operations are a separate concern. They should be discussed in a holistic way in a separate proposal. How can we discuss the security of doing anything with file paths without even having any APIs that do anything with file paths?
10 Likes

I dislike URL as I cannot remember which API is accessing the file system and which is not. Instead, FilePath without resolve() is pure and easy to think about.

This is a very important argument and one that I tried to make too, albeit less precise. The distinction between a syntactical resolve and a file system resolve is really important, the APIs should make this clear. "/tmp/foo".resolve() just is not clear. On a type that describes itself as

We propose adding FilePath and its essential operations to the Swift standard library. FilePath parses platform-specific path syntax on the developer's behalf, provides a normalized view of path components, and enables resolution against the filesystem.

which to me sounds very syntactic, "/tmp/foo".resolve() calling down to the file system, potentially blocking the calling thread, is misleading.

3 Likes

This functionality has long existed in Foundation as -[NSString stringByStandardizingPath]. Path normalization is a very common and critical security operation in e.g. webservers to prevent path traversal beyond a certain root. OWASP recommends using a library’s normalization routines when implementing such security logic. Even the library authors can goof it up, like the Apache project did when they failed to interpret “creative” encodings of path segments like ../, thus introducing a path traversal vulnerability.

So while you personally may never need this functionality, plenty of other people will, and they shouldn’t be rolling their own implementations.

3 Likes

This functionality has long existed in Foundation as -[NSString stringByStandardizingPath].

I'm aware, are you arguing this is good precedent? I find it to be a good argument on why we must not have this.

(Not important but stringByStandardizingPath deliberately partially unresolves a path. On Darwin, /tmp is a symlink to /private/tmp. But stringByStandardizingPath will turn /private/tmp/foo into /tmp/foo which means that after stringByStandardizingPath, a /.nofollow/ will actually no longer work. I think stringByStandardizingPath is mostly a path prettification function, not a path resolution function.)

Path normalization is a very common and critical security operation in e.g. webservers to prevent path traversal beyond a certain root.
OWASP recommends using a library’s normalization routines when implementing such security logic.

Very much correct. However, a web server will need to do much more than just calling a resolveViaFileSystem. A web server should utilise a good FileSystem implementation to ring fence accesses.

For example, an API I could imagine is

try await FileSystem.shared /* <- real FS */.withRingFencedFileSystem(rootedAt: "/var/www") { ringFencedFS in
    try await withWebServer { web in
        try await web.handle("/download-file/**") { request in
            try await web.writeAsyncSeq(
                // `ringFencedFS` automatically makes sure that `/foo` means `/var/www/foo` and that it's impossible to escape `/var/www`
                try await ringFencedFS.readChunks(request.path.chop("download-as"))
            )
        }
    }
}

Yes, withRingFencedFileSystem may internally use FileSystem.resolvePath(...) and also leverage /.nofollow/ on Darwin and/or NO_FOLLOW[_ANY] but I don't think it's a particularly good argument to say "Web servers really need FilePath.resolve". Yes, they need to take care of these vulnerabilities. No, FilePath.resolve is neither enough, nor what web servers actually want (they need it non-blocking, they need it to work on all OSes and the need to properly ring-fence their /var/www root and resolve is only a small building block for that).

7 Likes

I just want to echo that I couldn't have phrased it better than @johannesweiss here. I personally believe that we should remove the resolve methods from this proposal so it becomes focused on just providing a currency type for file paths. I see a lot of value in just adding a currency file path type that libraries can adopt in their public APIs across the ecosystem.

While I fully understand the need for better file system and file APIs in Swift, I deeply believe that this should be separated out and requires a larger, more holistic view that tackles the security-related problems brought up here and the async/sync and blocking/non-blocking questions.

10 Likes