Pitch: Add FilePath to the standard library

I thought prefix referred to fixed components like / or \\.\ that have a concrete, system-independent meaning. There is no such value for HFS-style paths; on a system with multiple hard disks, an HFS path on the external hard disk looks like External HD:Music:Logic Pro. It shares no elements in common with a path on Macintosh HD.

All the functions in Files.h take a “parent directory” argument, so working directories aren’t really a concept on classic Mac OS. If you try to create a relative path in the AppleScript Editor using POSIX file "foo.txt" as alias, it does produce an object that describes itself as alias ":foo.txt", but it causes an error. I don’t know if there exists any filesystem-related API that would accept such a string.

Correct, and that is why the comparison operations that @Michael_Ilseman has stated are important - they can ensure that roots are taken into account. What you are describing is precisely in line with what I stated in the previous reply. This is exactly the same shape as Windows. Each drive is effectively a separate volume. C:\Users\compnerd and D:\Users\compnerd do not have any elements in common. External HD and Macintosh HD are two separate volumes which is the prefix to the path.

I thought MacOS Classic had a weakly formed idea of current directories, and applications did behave differently regarding it.

I hadn’t internalized that latest approach to “prefix,” and I’m currently not convinced it’s worth the complexity. Specifically on Windows, I don’t find the proposed definition of prefixes for \\.\ and \\?\ paths intuitive. The whole purpose of these path formats is to allow arbitrary access within the Win32 namespace (not the NT namespace, by the way; the Win32 namespace is the subset of the NT namespace rooted at \Global??). I would therefore expect the root of all these paths to be \\?\:

  • \\?\C:
  • \\?\SomeDriveThatHasNoLetter\foo.txt
  • \\?\SomeLinuxSMBServer\c:\linuxAllowsColons.txt
  • \\?\Volumes{guid}\path\to\file.txt
  • \\?\Global\C: (TIL \Global??\Global is a symlink to \Global??)

I don’t think exposing any additional components in the prefix is a good idea, because it will encourage people to try to compare whether two paths have the same prefix as a proxy for some other semantic check, and this can backfire. For example, if I pass the path \\.\C:\..\D: to a Win32 API, I expect it to refer to the D: drive, but according to @Michael_Ilseman’s chart, FilePath.prefix would tell me that its prefix is \\.\C:. If I’m trying to restrict access to just C:, trying to use the prefix would be misleading.

1 Like

That’s my takeaway from this discussion. We’re almost a month into this thread, and we haven’t yet reached the end of “on X platform, Y thing works a little differently” – even just on our Tier 1 platforms.

For the purposes of the stdlib, I think we would do well to make this as universal and predictable (in the sense of being boring) as possible. Substitutability is a property of both platform/FS and use-case (e.g., trailing slashes are almost never meaningful on UNIX (no mater what the spec says), and it’s more-or-less just an oddity of rsync that it treats them differently), so there’s no universally not-wrong way to define equality besides non-normalized byte equality†. The main utility of FilePath (IMO, at least) is three fairly basic things:

  1. It’s a distinct type from String, which has all the benefits of a newtype, etc.
  2. It handles Windows’ UTF-16 paths transparently
  3. It allows fairly easy iteration / composition / inspection

Numbers 1 and 2 are I think entirely uncontroversial. Number 3 is mostly straightforward under almost any model, though there might be some room to quibble about how straightforward “inspecting” roots of a path is, as we’ve seen recently in this thread, but an easy answer – to start with, at least – is to leave out that kind of platform-specific behavior (or, rather, leave it to libraries to figure out for now).

In my mind, all of this points to a bag-of-bytes storage model (and byte-based equality††), which is dead-simple conceptually. Platform-specific behavior can live on top of the basic, universal concept of a bag-of-bytes FilePath.

† I also think FilePath.== is probably not a super common need, and developers using that behavior would be in the best position to know what kind of normalization is appropriate for their use cases. We don’t have to start out with stdlib-provided normalizations – we could leave that as a supplement provided by System or third-party libraries until we have a good consensus solution

†† We could also leave off an implementation of == for now. That’s probably going a bit too far, but it has its merits

6 Likes

We're getting into the weeds here, but paths are not canonical on classic Mac OS (MFS, HFS, or HFS+, take your pick). Two volumes can have the same name. The only way to get a canonical reference to a file on classic Mac OS is via a volume reference number and then a file reference number. The majority of Toolbox calls on Mac OS don't even accept a file path.

Realistically, I don't think we'd put a lot of effort into supporting non-POSIX, non-Windows paths as every modern system uses one or the other. CFURL historically had API for working with Macintosh paths as part of CarbonLib, and we could dust that off if we absolutely had to.

2 Likes

We still need to describe how a path is parsed if we are to expose components. For example, what are the components of the Win32 path \\server\share\foo\bar? Similarly, the XNU path /.nofollow/foo/bar? For that one, is it different than //.nofollow/foo/bar? (The XNU kernel certainly treats them differently). Even a simple POSIX path /foo//./bar/ yields different answers on C++17 and Rust. (FWIW, I think Rust is the better fit for Swift for what a component is).

I'm working on a second draft right now (along with @compnerd) looking to hone in on that proper subset while still being useful enough from day 1 to be compelling, while not painting ourselves into a design corner.

3 Likes

The encoding of .. as a separate enum case (ParentDir) is interesting, but the lack of an explicit representation of separators or empty components seems to betray the claim of “no normalization.” Windows supports both \ and / as path separators, but / is also the typical flag prefix. This raises some interesting security considerations when handling paths like \\?\Path/With/Space /s, especially in script-like programs that might pass arguments to a shell.

Without explicitly representing component separators, the program is forced to assume a slash direction when recombining a path into a string. This is effectively a hidden normalization step. Since / is the more portable option, I fear many applications would just assume / and unwittingly expose themselves to security concerns like the above. In addition, Rust has already had to fix a bug where separators were ignored for hashing. I wonder if hello/world and hello\world hash identically in Rust today, and if that poses any concerns.

If the goal is to avoid painting Swift into a corner, why not start with the most absolute basic representation, where every single element of the input string is represented structurally, with explicit normalization steps provided as opt-in APIs?

2 Likes

I could buy that even enumerating path components is far enough out of the "core" API for FilePath that it doesn't need to be provided by the standard library. I'm sympathetic to @arennow's "bag of bytes (or WCHARs)" point of view; most of what I (admittedly not a power-user of paths) want to do with a path is take a precomposed path from one place and feed it to another API somewhere else, letting the underlying OS interpret it according to the platform's conventions. Maybe occasionally resolve a relative path against a known base (again following the platform semantics for doing so, maybe with nofollow/no-resolve-beneath limitations)—though as @fclout noted, if we encourage openat-style APIs that accept relative paths only relative to a prevalidated base directory, that seems like it could greatly reduce the need even to concatenate or resolve relative paths as paths.

(To be clear, I'm not arguing against more detailed path manipulation APIs existing at all. But I wonder if they might work better if they were explicitly nonportable so they could be tailored to each host platform's behavior and quirks.)

I think this is extremely common. It’s necessary for any command line tool that takes a filename as an argument, or any program that wants to write to a temporary directory. It would strike me as odd to have to import System to do these operations. At that point, it seems like the whole type should just live in System.

5 Likes

For what it’s worth, that’s a rock throw away from use cases where O_RESOLVE_BENEATH (and thus openat) is a security requirement. A command-line tool that takes a directory on the command line to (e.g.) unpack an archive really must be careful to not write outside of that directory even if the archive says it has a file called ../../../etc/passwd and that’s exactly what O_RESOLVE_BENEATH ensures. If the hypothetical command-line tool takes a directory path and a file name from the user, it’s fine to concatenate them by hand as presumably the user is in control of both arguments already. On the other hand, openat also works and it creates a habit that works in sensitive scenarios too.

1 Like

though as @fclout noted, if we encourage openat -style APIs that accept relative paths only relative to a prevalidated base directory, that seems like it could greatly reduce the need even to concatenate or resolve relative paths as paths

The openat-style APIs are great for small little programs and specialised situations, but they come at a huge cost: Gotta have those directory file descriptors open (and manage them). It's just not always practical unfortunately.

Besides, macOS in standard configuration allows 256 file descriptors per process hard max, if you do any sort of parallel traversals, you'll struggle hard. Even going 256 directory levels deep (not that outlandish) in a single thread would already be too much if you'd always want to use pre-resolved directories.

1 Like

It sounds like what you have in mind for an openat-based design is that drilling into any directory requires one openat per component? I don't think that this is useful or works at all, for instance because sometimes you need paths to directories that don't exist yet. Rather, the one reason I like openat is that it frees engineers from difficult validation. You get to do less validation of paths as a bag of bytes on your end.

If you glob on a directory FD, I am generally not against that returning paths relative to the directory FD instead of FDs that you can open on your own terms–using openat on the directory FD. The point is to explicitly separate the part of the path that you fully trust to already exist and expect to have control over from the part that maybe you need a safety net for. For instance, if a rogue program replaces a file in the file list returned by glob with a symlink, openat (with O_RESOLVE_BENEATH) limits the damage to opening another file in the same tree.

1 Like
  • path.prefix == “\\\\server\\share\“
    path.components == [ "foo", "bar" ]
    path.hasTrailingSeparator == false
    

Yes, since repeated separators are only removed in the middle of paths. Otherwise, the NT global paths and shares you mentioned break as well. The first path has prefix /.nofollow/, the second one /.

I think so too. The repeated separator has no meaning on any target system (except maybe custom FS drivers on Windows?), and is normalized out.

I just found this terrifying note on MSDN Microsoft Learn:

Because it turns off automatic expansion of the path string, the "\?" prefix also allows the use of ".." and "." in the path names, which can be useful if you are attempting to perform operations on a file with these otherwise reserved relative path specifiers as part of the fully qualified path.

In other words, in a path that begins with \\?\, FilePath almost certainly should not convert . and .. components into .currentDirectory/.parentDirectory cases, and it must never drop them when normalizing the path.

Also it’s fun when the docs conflict with other docs. This doc says:

For file I/O, the "\?" prefix to a path string tells the Windows APIs to disable all string parsing and to send the string that follows it straight to the file system.

[…]

The "\." prefix will access the Win32 device namespace instead of the Win32 file namespace.

Meanwhile, the .NET companion to this doc says:

Skipping normalization and max path checks is the only difference between the two device path syntaxes; they are otherwise identical.

1 Like

Those are valid file/folder names, as are device names (even in terminal position on Windows 11), etc.

From earlier in the thread: