Pitch: Add FilePath to the standard library

A safe API ultimately requires coordination with the kernel. XNU supports this, but not Linux.

If FilePath on Linux were to store the user's intent that a path stay resolved, and Swift code checks this before issuing syscalls, we might be able to achieve a sort of pseudo-effectively-safe situation, but this is always going to be hairy and incomplete. I.e., Linux Swift code that makes syscalls could check that bit and use O_NOFOLLOW_ANY when applicable, or else try to approximate it via some kind of pseudo-atomic realpath use (i.e. check before and after the syscall), which wouldn't be 100% TOCTOU safe.

I say "pseudo-effectively-safe" in the same sense that you can safely use unsafe pointers if you're very careful, and everyone else who touches your code is very careful, and any libraries you call with that pointer are very careful, etc., all in perfect coordination. But even then, someone could scribble over memory with a different unsafe pointer.

This isn't a perfect analogy, as file system mutation can happen via what is normally considered "safe" and a timing attack would still be possible in theory.

1 Like

Broadly speaking: making guarantees about safety that are based on platform-specific features puts one in the awkward situation of having to say "we are making safety guarantees in part of the API on this platform, but these guarantees no longer apply if you're using the language on another platform". Ideally, if one is making a language-based security guarantee, it would be ideal if it were something that could be achieved wherever the language is used.

One big difference between Linux and macOS is that macOS has in-band signaling for path opening options (ie, by using a special prefix, the path itself can specify whether symlinks and .. should be followed). If you take a path out of a FilePath and pipe it through to a POSIX API, on macOS, these options can be encoded in the path itself and preserved this way, whereas they can't be on Linux. Linux still supports openat and O_BENEATH/O_RESOLVE_BENEATH, but you're at the mercy of the person doing the syscall.

One danger of relying on in-band signaling is that somebody else's mishandling can defeat it. For instance, /.resolve/2/path asks xnu to not follow .. components when resolving path, so conceivably Swift could prepend that to paths that are being passed off to C/C++/ObjC APIs if that's the desired behavior. But if that code lexically canonicalizes .. components while unaware of xnu's roots, it could still lexically canonicalize /.resolve/2/tmp/unzip/../../../../etc/passwd to /etc/passwd.

EDIT: after an offline conversation I realized it's probably relevant to FilePath that the in-band resolve flags only work for absolute paths.

1 Like

Here’s a counterpitch: Instead of adding FilePath, what about adding a file path protocol to the standard library? System.FilePath could conform to it, and so could Foundation.URL, file reference NSURLs, internal virtual paths inside that complex file type you’re parsing, and that cross-platform C library you pulled in that happened to define its own file path type (conveniently named Florglequat). Methods that take FilePath could take the protocol instead, we could add shims in the Foundation overlay so that all the APIs that take file URLs could take the protocol instead, and then you could pass paths you got from one set of APIs to another set of APIs without having to constantly do things like FilePath(fileURL.path) all the time.

It would certainly be a boon to library developers, as you wouldn’t have to add separate overloads of every path-using method taking every possible kind of file path someone might want to pass in, and you wouldn’t have to pull in both swift-system and Foundation to provide said support.

1 Like

I spent an unreasonable amount of time (and tokens) grappling with what FilePath should do with respect to trailing slashes as well as lexical normalization. I thought I'd share some findings so far.

Here's a Broad language survey that covers C++17, Rust, and Python's Pathlib in great detail, as well as Go, Java.NIO, Node.js, Zig, and Haskell.

Here's the synthesized tables providing an overview by area.


Based on this research, I'm proposing some changes from the current swift-system behavior. This is for POSIX paths; Windows will be addressed separately.

Trailing separators

I think that FilePath should preserve a trailing slash if given a trailing slash. Trailing slashes carry meaning in rsync , .gitignore , shell completion, and other application-level conventions. As for stat, a FilePath instance shouldn't destroy information that downstream consumers need without explicit API calls.

Python's pathlib strips trailing separators on construction, destroying the information, and this is largely seen as a mistake that they've been unable to walk back. Swift-system currently strips trailing separators the same way. This is the last opportunity to change that before standard library ABI commitments make it more permanent.

Rust stores verbatim bytes, so the trailing separator was never lost. The nightly has_trailing_sep / with_trailing_sep / trim_trailing_sep APIs add ergonomic accessors for information that was always there but previously required raw byte inspection. Haskell has had equivalent first-class APIs (hasTrailingPathSeparator, addTrailingPathSeparator, dropTrailingPathSeparator) from the start. I think we should provide the same three APIs.

We would still normalize repeated separators on construction (a///b to a/b ), since that's purely an encoding-level concern with no semantic content. FilePath is a COW type, meaning it will be copying the bytes over, so it might as well do semantic-preserving normalization that preserves path algebra and speeds up component iteration, equality, and hashing.

That being said, this raises a question about substitutability: what should Equatable and Hashable mean?

Equality

I weakly believe FilePath("foo/bar") and FilePath("foo/bar/") should compare equal and hash identically, but their literal bytes should differ as one stores a /. This means that insertion and retrieval in a Dictionary can change the presence of a trailing slash, and that will be confusing, but less confusing than the reverse.

Code that needs to distinguish the two forms can use an explicit hasTrailingSeparator API. Similarly, code doing niche symlink handling can explicitly add or remove the trailing separator as needed.

This is, I believe, the right default for a currency type: the common case (these name the same thing) should be easy, and the uncommon case (I care about the trailing separator specifically) should be possible. But, this is obviously a place where there's no perfect solution, we're just trying to find the right solution for Swift.

Normalization

I've also come to believe that lexicallyNormalized() should not resolve .. by default, at least for POSIX-style paths.

Languages are split on this. Rust's component iterator and Python's pathlib both preserve .. . Go's filepath.Clean and C++17's lexically_normal() (lexically) resolve it. I think preserving .. is the right default for Swift: lexical .. resolution is only correct in the absence of symlinks, and a method that silently gives the wrong answer in their presence is an attractive nuisance.

I think we should still offer .. resolution, but as an opt-in rather than the default sense of "normal.". Similarly, normalization could preserve the trailing separator by default (consistent with construction). For example, perhaps we instead have (but with a better name):

func normalize(dropTrailingSeparators: Bool = false, lexicallyResolveDotDot: Bool = false)

Note: I said earlier, I believe we must ship full resolution API for any lexical normalizing API that handles .. and give the full resolution function the better name.

Question: Drop interior or trailing . on construction

A related question is whether construction should also drop interior . components. Currently swift-system preserves them: FilePath("foo/./bar") stores the . as a component. Rust's component iterator (upon which equality and hashing is based) silently skips interior . (but not leading . ), and Python removes them on construction.

There's an argument for stripping them eagerly: . components have no semantic content in a path (unlike .., which does), and removing them at construction simplifies components, iteration, and comparison without losing information. If we're copying the bytes over anyways, and also normalizing repeated separators (a//b -> a/b), then this is a good time to drop the dots.

If we do strip . on construction, note that FilePath.ComponentView's RRC append operations could reintroduce them as part of its path algebra. We still want explicit normalization API, and one could argue that normalizing the . is now done defensively if you don't know where a path came from. If we don't strip ., we have a more consistent normalization story for ., but we're encouraging developers to call that normalization function in more places in code. At this point, we'd probably establish a term like "canonical" instead of "normal" for on-construction operations.

My very weak current opinion is that we keep interior . on construction (like C++17 and Rust) but treat it as an actual component for iteration, equality, and hashing (like C++17 and existing swift-system, but unlike Rust). This preserves the path algebra: RRC operations don't need to worry about silently reintroducing something that construction would have removed. Explicit normalization is then the tool for cleaning up . components, whether they came from construction or from mutations.

I'd like to hear the community's thoughts on where the right line is.

8 Likes

I don’t see how this follows. Can you elaborate more on why you believe Rust’s approach is incorrect here? In the underlying implementation, the path is a bag of bytes passed to the filesystem for parsing. Rather than make convenient assumptions that vary by platform and may be invalidated by future updates to the underlying OS, why not do the simple and predictable thing by default and expose normalization APIs for cases where users wish to normalize according to the conventions of the platform they are currently running on?

1 Like

This is great! As an initial reaction based on these tables, my inclination would be to agree that behavior on initialization helps sidestep some of the questions of later behavior.

One of the strengths of Swift is that we can have labeled initializers that provide alternative behavior, allowing any user to specify their intent without assumptions. Therefore, an alternative design which (to me) appears to be cleaner would be:

/* Multiple initializers */
init(verbatim:)               // Nothing normalized, like other languages with verbatim storage
init(lexicallyNormalizing:)   // Full lexical normalization
// Possibly a third init which is closer to current swift-system behavior;
// and/or provide additional defaulted parameters on the API above (and below)

var lexicallyNormalized: Self // A copy of the current path, lexically normalized

Equality and hashing could be based on raw bytes thereafter: A user who chooses to construct only lexically normalized paths—or, obtaining any path from a third-party API, uses a lexicallyNormalized copy—would have equality and hashing based on normalized paths exactly as desired. A user who chooses to use only verbatim paths would have verbatim comparisons.

I'd agree with going with the majority in the survey wrt basename, dirname, and join behavior here, as long as the explicit hasTrailing..., addTrailing... and dropTrailing... APIs are available.

7 Likes

Once annoyance of the current System FilePath I would like to see changed in the standard library is the way the value is printed when debugging. For example, this is what a FilePath currently looks like in a REPL:

  1> import System
  2> FilePath("some/file/path")
$R0: System.FilePath = {
  _storage = {
    nullTerminatedStorage = 15 values {
      [0] = {
        rawValue = 115
      }
      [1] = {
        rawValue = 111
      }
      [2] = {
        rawValue = 109
      }
      [3] = {
        rawValue = 101
      }
      [4] = {
        rawValue = 47
      }
      [5] = {
        rawValue = 102
      }
      [6] = {
        rawValue = 105
      }
      [7] = {
        rawValue = 108
      }
      [8] = {
        rawValue = 101
      }
      [9] = {
        rawValue = 47
      }
      [10] = {
        rawValue = 112
      }
      [11] = {
        rawValue = 97
      }
      [12] = {
        rawValue = 116
      }
      [13] = {
        rawValue = 104
      }
      [14] = {
        rawValue = 0
      }
    }
  }
}
3 Likes

This has actually recently been fixed, here.

2 Likes

Interesting, thanks for sharing. Although it appears that the bytes are still listed (expanded formatter). That’s annoying when printing a path on the debug console and then having to scroll back up to read the actual path (especially for long paths like $TMPDIR/$UUID/$BUNDLE_ID/$RESOURCE_ID). Or is the collapsed formatter used by default?

How would you describe what FilePath.== does at a high level? It's not checking if two paths open the same file, and it's not checking if two paths are lexically equivalent. As far as I can tell, it's checking if two paths are spelled identically. In that context, I think the correct behavior regarding trailing separators is that they should factor in.

5 Likes

I would say that this new proposal is pretty much very close to what I would like to see. The one thing that I think remains is already called out as "but with a better name". I would think that even func normalize(options:) might be a better spelling.

After a few iterations, the current PR prints the rendered string and the (horizontally printed) _storage byte array, and does not show the internal (vertical) list of bytes, e.g:

(lldb) p utf8Path
(System.FilePath) "/hĂŤllo/wĂ´rld" {
  _storage = ['/', 'h', 0xC3, 0xAB, 'l', 'l', 'o', '/', 'w', 0xC3, 0xB4, 'r', 'l', 'd', 0x00]
}

If you have feedback please feel free to comment in the PR!

2 Likes

Would it be reasonable to just render as a string and escape "special characters" in that rendering, or do you feel that it's important to always call out the exact byte encoding used since that might matter to the OS?

Previously, I had it show just a string with any non-(printable ASCII) characters escaped, e.g:

utf8Path = "/h\x{C3}\x{AB}llo/w\x{C3}\x{B4}rld" System.FilePath

The downsides of this approach were mainly 1) the lack of readability if a user just wants to see the string and 2) other characters like "\" then require escaping, which makes all common Windows paths read like "C:\\hello\\world".

Being able to see the actual bytes is important for debugging because NFC vs. NFD form could refer to different file names and cause errors, for instance. Also, the lldb string rendering appears to omit control characters like \x{01} or \x{02}, so only the byte view would show that sort of data corruption.

All-in-all, showing both the rendered string and a (horizontal) view of the bytes if needed seemed like the best approach.

3 Likes

Rust is a different language than Swift and its path type is a borrowed view into read-only memory. Swift's FilePath type is a COW-managed object that owns its own copy of the path data. If Rust were to do canonicalization/normalization on construction inside its storage type, it wouldn't help them because the view type (where the API lives) does not necessarily point to their storage type. Instead, they have to canonicalize/normalize as part of every read.

An analogous operation in Rust could be:

let normalized: PathBuf = path.components().collect();

Using nightly Rust:

#![feature(path_trailing_sep)]
let had_trailing = path.has_trailing_sep();
let mut normalized: PathBuf = path.components().collect();
if had_trailing {
    normalized.push_trailing_sep();
}

Rust stored verbatim bytes, lost trailing separator information through the iterator, and is now adding nightly API to recover it. I'm not saying Rust is incorrect here, I'm saying they have a fundamental difference in how they use storage and view types. Rust might have a regret here around their treatment of trailing slash, but it's not clear there's an obvious solution beyond greater granularity of control in their programming model.

Rather than make convenient assumptions that vary by platform and may be invalidated by future updates to the underlying OS

I do not see coalescing repeated separators as being a convenient assumption that may be invalidated by future updates by any tier-1 supported OS. I'm not even sure it's technically possible for Windows, Linux, or Darwin to deviate here even if they wanted to. Could you elaborate on your concern?

I don't think we want an init(verbatim:) that skips separator coalescing. Any normalization we skip at construction time we repeat at every read: component iteration, equality, hashing, etc. I'm not sure there is a useful level of storage below coalescing repeated separators.

We might want to take a step back and define what "canonical form" means for FilePath, as a concept lighter than full lexical normalization. One possible definition of canonical form is:

  • Repeated separators coalesced
  • Interior . components collapsed (. only appears at the start of a relative path)
  • Trailing separator preserved as-is
  • .. components are left untouched

Win32 non-literal paths would additionally:

  • Convert / to \
  • (potentially) lexically resolve .. or leave it for full lexical normalization
  • (potentially) handle devices, or leave them for full lexical normalization, or even full kernel resolution

FilePath would store in canonical form upon construction. Its component view could maintain this invariant or there could be an explicit canonicalize operation after mutations. Maintaining the invariant is really nice as it would allow for substitutability to be literal byte equality.

The downside of maintaining the invariant is some surprise in RRC's path algebra. path.components.append(".") would result in . if path were empty, otherwise result in path. Similarly, insert's behavior would depend on whether insertion is into first position of a relative path.

(Alternatively, we could ignore . as part of a canonicalization invariant)

If we go with this canonical invariant route, I think this starts to argue that FilePath's notion of "substitutability" is byte equality of paths in a canonical representation that preserves trailing slash. This makes it amenable to domain conventions as well as stat.

Treating trailing separators as relevant to substitutability is surprising, but not treating them as relevant to substitutability is also surprising. Establishing the concept of a canonical representation that preserves trailing slash at least gives us a consistent choice.

For example, it could be surprising that FilePath("/tmp/foo") != FilePath("/tmp/foo/"). That is, it could be surprising for them both to be separate Dictionary keys, but it would also be surprising if they weren't, as retrieval might preserve or strip trailing slashes. We're shuffling where in code developers need to care about this, but we're not able to fully define it away.

I worry that if we don't provide an initializer that takes the path verbatim and treats it as an opaque bag of bytes, then there will be use cases that immediately have to fall back to some other far less appropriate data type to attempt to preserve it, and that will be a foot-gun for developers.

The most obvious example that comes to mind is logging/diagnostic reporting. If I'm reading paths from an external source, then as I'm processing them, I may want to report something to the user about them. And even if I normalize the path before using it, I may want to do any user-facing reporting with the path exactly as they wrote it in the file to avoid confusion.

If a developer is unable to do that, they're going to be tempted to fall back to String (ruining their ability to handle non-UTF-8 paths, now a correctness issue) or some awkward array of bytes type that they need to decode anyway to present to the user. I'd rather let FilePath do that conversion for me, because it knows how to handle those edge cases better than I would.

If we start picking and choosing what kinds of normalization we automatically consider, I worry that we'll close off important use cases and make correct code harder to write. We've already acknowledged that trailing separators can be semantically important, so I think we really do need an option that acknowledges that the path as it was originally written is important as well, and if a user wants to canonicalize it, they have that explicit option.

7 Likes

It’s not only an assumption, it’s an incorrect assumption. Consecutive separators are syntactically meaningful on Windows: \\server\share is a UNC path; \\?\C:\ is a DOS device path.

I completely agree with @allevato: the API should encourage users to immediately express their intent to deal with user input as a path, rather than encourage them to traffic in strings until the moment of resolution.

1 Like

Don’t these repeated separators only occur at the beginning of paths? (i.e. C:\dir\\dir\file is the same as C:\dir\dir\file)? That’s why POSIX and pathlib don’t conflate them at the start.

That’s true for now. But I don’t recall seeing an “start of path” exception to the coalescing normalization rule in this pitch. What other mistaken assumptions has it made? And what other correct assumptions might be invalidated in the future, whether by the existing supported platforms or by expanding to new ones? I suspect the whole reason Windows uses consecutive separators is because it served to retrofit a new semantic onto a previously invalid syntax.

At least with an explicit normalization step, the point of failure is known and contained.

1 Like