Pitch: Add FilePath to the standard library

yes i’m familiar with that, i just don’t find it persuasive. it doesn’t make systems safer, it just nukes Swift’s utility as a scripting language and pushes people towards Bash and/or Python instead.

1 Like

Thread tl;dr — what’s the proposed mechanism for pulling FilePath into the stdlib? If it requires some explicit action by clients, then we could indeed try to take on the Codable format transition, though it would have to be well-documented. Clients that need to maintain compatibility will have to continue to encode/decode System.FilePath instead and convert to/from Swift.FilePath.

If it’s something more automatic, then changing the Codable implementation out from under clients is just too risky. I’ve postulated that clients could opt in to the new implementation with a property wrapper or macro, e.g.:

@FilePathCodableFormat(.v2) var path: FilePath

I’m not aware of any clear precedent for this though.

It is also feasible that a CommonCodable and/or JSONCodable, etc. conformance could use an alternative format, since clients must opt in to using one of the new encoders/decoders.

1 Like

FilePath.Root() as a no-argument initializer makes sense on Linux today, where there's exactly one root. It wouldn't be available on Windows as it's not meaningful there. On Darwin I'm not 100% sure, given the trend in XNU towards the resolution assertions discussed earlier. Note that we have some ergonomic affordances in FilePath.Root's ExpressibleByStringLiteral conformance (you can use "/"), so I'm (weakly) inclined to leave it out. But I'd appreciate more feedback on this point.

2 Likes

Arguably, the root directory of the boot volume (usually but not always "C:\") is the right string to use on Windows, but computing it is oddly complicated. I would mark rootPath as just unavailable on Windows.

1 Like

Is it though? What if your current drive is something else? This is particularly important as \ is a (drive) relative path. This shows up in say cmd:

> S:
> cd \
> cd
S:\

Because each drive has its root, and each drive has its own current directory, its confusing to have the idea of root. Why is the root not the root of the current drive?

If we agree there, what happens when we are not on a drive? Say my current location is \\?\Volumes{903d46fb-3995-440a-bdc7-b67604b9b03a}\ ... well, is that the root? Is the object manager the root?

I think that simply not having FilePath.Root is the right answer.

1 Like

You're agreeing with me. :grimacing: (We already had this debate privately!)

Edit: We may want to have a driveLetter: Character? property or similar on Windows for scenarios where you're decomposing the path. I don't know what you'd decompose \\?\... into, but I'm sure we could consult Microsoft's documentation for ideas/advice.

I was under the impression that the URL does a good job at representing file paths and is part of the standard lib.
I have been using it on linux/windows code bases.
It also fits perfectly with FileManager for one to copy/move/delete etc using urls.

1 Like

URL is not part of the standard library, it's part of Foundation (as is FileManager). :slightly_smiling_face:

And the proposal adresses it. They want to use FilePath to write the standard library.

It really doesn't. Even simple operations, like getting the file path as a String back out of the URL, have to explicitly deal with quirks of URLs like percent encodings, and the simple .path property was deprecated since it's easy to forget to do that and get a botched result.

6 Likes

I think this could be a key detail. If we canonicalize away redundant interior dots, the way Rust's component iterator does, then we no longer need a concept of "normal" or "lexically-normal", merely a concept of "resolved" (whether real or lexical) as the only remaining concepts are .. on Darwin/Linux and, separately, Win32 device handling (which is dependent on Windows version and not known at compile time).


Canonicalization

FilePath normalizes encoding redundancy upon construction and mutation such that its contents are always in canonical form. Trailing slash is preserved. Equality is based on literal byte equality of the FilePath's contents, which are always in canonical form. This means trailing slash is relevant to equatability/substitutability.

Repeated separators are coalesced: a///b becomes a/b. Interior and trailing . components are dropped, but the preceding separator is preserved as a trailing separator, so a/./b/. becomes a/b/ and consists of two components: [a, b]. A leading . on a rootless path is kept, since ./foo is meaningfully distinct from foo at the application layer (shell $PATH lookup, execvp). On non-verbatim Win32, / is converted to \, since these are two spellings of the same separator. Trailing separators are preserved, as discussed upthread.

Win32 verbatim paths (\\?\) exist because the Win32 layer performs mandatory canonicalization between userspace and the kernel: separator conversion, ./.. resolution, trailing dot/space stripping, device name substitution, and MAX_PATH enforcement. \\?\ tells the Win32 layer to skip all of that. The way to think about verbatim is that it widens the component namespace: . and .. have no special meaning and are treated as regular names, / and : are legal component characters, and device names like CON are just names. The only characters forbidden in a verbatim component are \ (still the separator) and NUL. Our separator coalescing still applies since that is separator handling, not component handling. Other platforms don't need verbatim because there is no mandatory canonicalization layer between userspace and the kernel; the bytes you pass to open(2) are what the kernel sees. If a platform adds such a layer, we add verbatim then. This also means that, for example, .. in a verbatim path has Component.Kind .regular, not .parentDirectory.

// Separator coalescing
FilePath("a///b")                // "a/b"

// Interior dot dropped
FilePath("a/./b")                // "a/b"
FilePath("/usr/./local/./bin")   // "/usr/local/bin"

// Trailing dot dropped, preceding separator preserved as trailing slash
FilePath("a/b/.")               // "a/b/"
FilePath("a/b/./")              // "a/b/"   — same result either way
FilePath("a/.")                 // "a/"

// Leading dot on rootless path: preserved
FilePath(".")                   // "."
FilePath("./")                  // "./"
FilePath("./foo")               // "./foo"

// Dot after root: dropped
FilePath("/.")                  // "/"
FilePath("/./foo")              // "/foo"

// .. is untouched
FilePath("..")                  // ".."
FilePath("a/b/../c")           // "a/b/../c"

// Trailing separator preserved
FilePath("/tmp/foo/")          // "/tmp/foo/"
FilePath("/tmp/foo/") != FilePath("/tmp/foo")   // true

// Empty path
FilePath("")                   // ""  — distinct from "." and "/"

// Combined
FilePath("a/.///./b/.")        // "a/b/"  — two components: [a, b]

// Win32: / converted to \
FilePath(#"C:/foo/bar"#)       // #"C:\foo\bar"#
FilePath(#"C:."#)              // #"C:"#

// Win32 verbatim: . and .. are regular names
FilePath(#"\\?\C:\foo\.\bar"#)   // #"\\?\C:\foo\.\bar"#  — three components
FilePath(#"\\?\C:\foo\..\bar"#)  // #"\\?\C:\foo\..\bar"#  — three components

Just to highlight this point, under the above, trailing / affects equality and the following are distinct paths: "", ".", "./".

2 Likes

Question for my learning—yes, I think your analysis shows we could do this coalescing mandatorily and the rules for that can be workable, but what use cases would it enable that would be difficult or impossible otherwise? Are there any platforms for which non-coalesced no-op internal . components cause mischief?

The reason I ask is because, with different rules for leading dot, rational but not exactly immediately obvious behavior for trailing dot-slash, different behavior for .., and a separate codepath for Win32 "verbatim" mode, it is a whole lot of explaining which...maybe we don't have to do for most or all ergonomic uses?

Any path type has storage, substitutability, canonicalization, and resolution. Maintaining canonical form as a type invariant leaves resolution as the only explicit concept. This gives us a model that's simpler to explain, reason about, and gives the user the ergonomics they're able to reasonably expect.

Right now, this pitch has a flawed concept of "normal". It conflates representational-normalization (which I'm now calling "canonicalization") with path resolution (which may or may not be lexical). FilePath as pitched canonicalizes separators, including coalescing repeated separators, but not interior . components. It also drops a trailing slash, which we've established as wrong.

Supporting non-canonical storage requires that we have API to put file path into some canonical form. For example, this may be spelled normalize(dropTrailingSeparators: false, lexicallyResolveDotDot: false) or just canonicalize(). It also requires that we either establish substitutability as over canonical form (which means we're not doing memcmp-style comparisons) or over what happens to be stored (in which case we aggressively educate developers to always canonicalize before use or comparison).

If FilePath stores and maintains canonical form as a type invariant, it eliminates an explicit layer of API and corresponding concerns for the developer. There is no isLexicallyNormal, no normalize(), no question of what "normal" normally means and whether it conflates resolution. Equality is byte equality. Component iteration is clean and naturally follows from what you see when you print the path out. The rules around canonicalization have nuance, but that nuance lives in the type's implementation and documentation rather than in user code.

4 Likes

What about the concept of a path prefix? What would be the components of that path?

A prefix concept might be a better fit for XNU resolve flags too and it might let us specially handle leading ..

Something like:

Prefix kinds

Platform Prefix Meaning
All (none) Implicitly cwd-relative
All ./ Explicitly cwd-relative
POSIX / Filesystem root
Darwin /.resolve/N/ Filesystem root + resolve flags
Windows C:\ Drive-absolute (absolute)
Windows C: CWD on drive C (relative)
Windows \ Current-drive root (relative)
Windows \\server\share\ UNC
Windows \\?\... Verbatim (NT namespace)
Windows \\.\device\ Device

Would it make sense to limit the standard library FilePath API to concepts that universally make sense, and push OS-dependent concepts like a root/prefix to System or another more expressly platform-specific package? That could give us more flexibility to tailor the API on each supported platform to match the semantics of each platform, without having to water them down into something that vaguely fits all of them.

5 Likes

What about the concept of a path prefix? What would be the components of that path?

I'm not sure I understand. Is this not just what the FilePath.Root type is? Is the idea to rename FilePath.Root to FilePath.Prefix? You are correct in that the current model does work well for mapping the "prefix" application. If nothing else, I think that the rename does have the benefit that it makes it more clear that Root is not the file system root.

2 Likes

My two cents on canonicalization as someone who did a lot of research on this many years ago. A lot of times, paths are pass-through values. I get a path from some (C) API, eventually, I pass it to some other (C) API. What's in the path is not interesting to me at all. Not messing with the bag of bytes is perfectly fine, and fast.

On the other hand, depending on methodology of your attempt to canonicalize, you deal with character encoding, semantics of the path etc. You could encounter (hypothetically) some weird operating systems in the future where its APIs thinks a bag of bytes is valid, and your canonicalization thinks it's not.

1 Like

Certainly no expert in this area, but I can see a lot of "uses" that are as @duan describes, without any need to perform operations on the path.

And as to substitutability, I can imagine applications where bag-of-bytes substitutability is sufficient, and of course plenty of applications which really require actual filesystem-based resolution (ideally without creating TOCTOU bugs)—i.e., do these two paths boil down to the identical resource?

But I'm not sure I understand the need for a idiosyncratic-to-Swift lexically-normalized-while-preserving-dot-dot notion of substitutability, which as you say oftentimes on many platforms is something close to but distinct from actual resolution. Should we be vending such a notion of substitutability at all? And even if so, shouldn't it be an explicitly requested operation that isn't a default == that can be invoked generically?

4 Likes

I understand there is this concept of “Tier 1” platforms, but I am curious about how this design will behave on platforms where file paths are substantially different from POSIX or Windows.

For example, on classic Mac OS (and even modern macOS when using old FSRef APIs), Macintosh HD:System:Control Panels is a fully qualified path. There is no root prefix to indicate this.

6 Likes

I'd argue that is not the case. In fact, I think that this was the insight that @Michael_Ilseman pointed out above: FilePath.Root is a misnomer and that FilePath.Prefix might be more appropriate.

In this particular case, Macintosh HD IMO would be the "prefix" - it roots the path at the volume (effectively the same as the drive letter on Windows) and makes the path absolute. If there is no prefix (which is the first entry in the table) - the path is relative, which holds here as well: :System:Control Panels would be a relative path relative to the current working directory.