SE-0529: Add FilePath to the Standard Library

Thanks for sharing this. It's a great article and really shows that we need higher level FileSystem APIs that make it safe to interact with the file system instead of operating on file paths all the time. File paths are still an important currency type to add but I think it's crucial to separate system interactions from such a currency type.

Have we strayed a bit off topic?

I think it's sufficient for this proposal to introduce a basic, concrete "file path" type, upon which we can build FileSystem and other enhancements later.

Questions about foreign file paths and blocking I/O don't seem germane to me (though I fully support the community discussing them! … just not in this review thread. :grimacing:)

1 Like

It’s directly relevant to the question of whether FilePath should offer APIs that attempt to interpret the semantics of the path.

My spidey senses were particularly set off by this header comment:

    /// On Windows, returns the drive letter for anchors of the form
    /// `C:\`, `C:`, `\\?\C:\`, or `\\.\C:\`. Returns `nil` for UNC
    /// anchors, non-drive device anchors, and the current-drive root `\`.
    public var driveLetter: Character? { get }

Confusion about how path names map to actual devices has been exploited several times by malware authors, to the point where Project Zero has multiple articles on the topic:

These articles are a decade old, so they might even be out of date. That is also true of any semantics that we bake into FilePath.

1 Like

I think we can reasonably land the container type first with the uncontentious interfaces, and then iterate on the bits that aren't as straightforward.

3 Likes

Hi all, I have some feedback from the LSG on things to further refine through discussion here with regard to FilePath as proposed. I will, separately, respond to the recent discussion later if applicable.

First, this PR contains the proposal with the update I mentioned earlier.

Reconsider separator: Character and driveLetter: Character

The LSG noted it's odd to introduce FilePath.CodeUnit as a platform-specific code unit type and then not use it consistently for things assumed to be individual code units. Since the two API have separate concerns, I think it makes sense to tackle them separately, though some of the rationale may carry over.

My rationale for choosing Character for FilePath.Anchor.driveLetter is that it is more towards the printing or comparison-against-literal end of the spectrum and less towards the custom-parser end. A developer might want to take the drive letter, capitalize it, and then compare against a literal. These would not work with UInt16 without jumping through a lot of extra hoops. Something like a parse result that has the position of the drive letter, possibly generic over ~Escapable storage, was mentioned as future work.

Under this rationale, assuming we keep driveLetter (see below for why we may or may not want to) and keep Character as the type, it might make sense for the API to always capitalize and document it as such on behalf of the user.

However, this rationale does not apply to FilePath.separator which is not a failable instance member like driveLetter, but rather is a constant static member denoting the platform's canonical separator. This API is far less likely to be compared against a literal and is much more likely to be used in lower-level parsing or byte-splicing code (i.e. along with FP.codeUnits and FP.init?(codeUnits:)). As a platform-level constant, I think this should be changed to FilePath.CodeUnit.

(I will note that both of the API are severable.)

Platform-specific API direction and driveLetter

The LSG wanted to see some more discussion of whether some API, such as driveLetter, should be conditionally available based on platform vs universally available. A potential future direction is sketched here, though there are many ways to spell/slice things.

On one end of the spectrum, the proposed driveLetter and isVerbatimComponent have degenerate fall-through values for platforms that don't support them: nil and false. On the other end, Component.init?(verbatim:) is truly meaningless outside of Windows and so is platform-conditional (similarly, resource fork API on Darwin).

I think that with the sketched future directions, it might make more sense for isVerbatimComponent to be Windows-only. Similarly, driveLetter as any code in the non-nil branch is inherently Windows-specific.

Separately, I think it makes sense to have a different library that provides foreign path support, something akin to:

// Available on all platforms.
public struct Win32Path { ... }
public struct XNUPath  { ... }
public struct LinuxPath { ... }

extension FilePath {
  // Failable: returns nil if the FilePath isn't representable as the foreign type.
  public var asWin32Path: Win32Path? { get }
  public var asXNUPath:   XNUPath?   { get }
  public var asLinuxPath: LinuxPath? { get }
}

extension Win32Path {
  // Failable: nil if the foreign path isn't representable on the host.
  public init?(_ filePath: FilePath)
}

Closure-based access to NUL-terminated code units

The LSG wants the count passed to the closure and to consider other names than withCString.

Candidate names:

  • withNullTerminatedCodeUnits(_:). Explicit in the name about the NUL guarantee, matches the var codeUnits / nullTerminatedCodeUnits symmetry.
  • withCodeUnits(_:). Documents the NUL guarantee separately rather than in the name.

I lean toward withCodeUnits(_:) as the count does not include the NUL byte.

Separately, there's some interest in keeping var nullTerminatedCodeUnits: Span<FilePath.CodeUnit> alongside the closure form. I'm going to (weakly) argue against keeping it, deferring it until there's a clearer future Span-to-C bridging story, but either decision is defensible.

1 Like

For the code unit issue, what about making it a RawRepresentable struct wrapping an integer? Then you could add either Character(codeUnit) or codeUnit.character/.string to get at the type as a string. It would mean that collections of these types would not directly bridge to C, but you can still safely bitCast them to the underlying type.

FWIW, it does feel odd to reify drive letters in the API -- a peculiarity only(?) Windows has, and not instead introduce something that feels a little bit more generic like anchorComponent or something that might be a drive letter on Windows but / on SUS systems. I'm handwaving, but still.

Core rationale: Parsing paths correctly seems easy but is surprisingly hard and bug prone in the general case. Even on Darwin, it's an extremely small set of developers that understand how paths really parse. I have recently become a member of that small set; it's not something I'd want other developers to have to learn just to do very simple things like pop components off of a path. This is somewhat analogous to Unicode correctness, which seems easy at times but in practice is not something you want to have to reinvent.

On Linux, it's fairly easy. If you keep popping components off of /foo/bar you end up at / and then the operation becomes idempotent. On Darwin, if you keep popping components off of /.nofollow/foo/bar the fixed point is /.nofollow/, not /. Similarly, /.vol/123/456/foo/bar would be /.vol/123/456 (note: no slash at the end). Furthermore, /.resolve/1/ is a syntactic alias for /.nofollow/, etc. None of this is obvious without having niche expertise in the XNU kernel.

As for the existence of Anchor and suffix information: it is semantically relevant and part of FilePath's semantic model (e.g. ==). If FilePath were to internally parse paths correctly, but only expose the ComponentView (or even just push/pop components off the sequence of relative path components) and not expose Anchor or suffix information, then it would be hiding semantically-relevant information from the developer. E.g., in some domains, developers might want to strip a trailing slash, so we provide interfaces to do so. Similarly, resource forks are part of FilePath's semantic model (and naively look like relative path components even though they are not). Hiding their presence would force developers into using the C-bridging or byte-level interfaces to manually parse them and then do byte splicing and path reconstruction. Having dealt with the corner cases involved myself, I would not wish this fate even on my foes.

Will do. Thanks for pointing this out.

This is why FilePath and the stdlib should be responsible for correctly parsing paths. FilePath will stay up to date with changes in, e.g., the XNU kernel. It is unreasonable to expect developers to write their own parsers and keep them up to date.


Could you elaborate on this example or this need? I'm not sure I fully understand.

FilePath.ComponentView is RRC, so you can get the last relative path component, remove it, append a new one (or collection of components), etc. An earlier pitch had a slew of syntactic API directly on FilePath and those are deferred as future work to focus on the essential: parsing paths correctly on behalf of the developer without hiding semantically-relevant portions or bits.

We include non-syntactic resolve() because of the strong urging from the security community that, analogous to how parsing paths is surprisingly hard, resolving paths is surprisingly hard and bundling this with FilePath from day 1 is essential for security purposes.

I believe the RRC conformance gives you the tools necessary to implement syntactic operations, but it may be worth prioritizing (in a follow up proposal) a set of important syntactic operations or conveniences, such as joining paths, with defined semantics around all the corner cases that come up (corner cases that often can only be described in terms of Anchor).

Could you elaborate or describe what the ask is? If you're referring to the POSIX prefix for implementation-defined path syntax, e.g. how cygwin represents Windows paths, that's future work for the relevant platform.

I want to make sure we are not conflating multiple things here. \\.\ is the device namespace, to which Microsoft could certainly add things, but which has a well-defined parsing. The content before a separator is the device (part of the anchor); the content after is not. \\?\ designates a different path syntactic form (and lifts path length requirements if applicable, which is vital for many libraries) but still has a well-defined parse (noting that the equivalent of \\srv\shr is \\?\UNC\srv\shr). All of these are Win32 paths in that they go through Win32 API. FilePath does not model NT object-manager paths (e.g. \??\ and many other forms), which are something else entirely and do not go through Win32 API.

2 Likes

That might be useful in the future, especially for systems-level wrappers or libraries, but I think that's too heavy weight for the stdlib at this point. It would also require pointer casting (or worse, .lazy.maping) to get at the numeric values directly. FWIW, swift-system has internal _SystemString which wraps an Array and _SystemChar which then wraps this numeric type, and there's a chance this could become public for syscall wrappers. I think that would be a different API that system adds to FilePath on top of what comes with the stdlib.

This would be future work as a sort of "named volume" concept. /.vol/xxx/yyy, \\srv\shr, C:\, etc., would all be explicitly named volumes (technically, explicit starting locations on named volumes). POSIX-ish systems also have / which is the root of an implicit unified namespace (i.e. the VFS). This could be a great future direction, possibly in a different module than Swift.

If the primary goal of driveLetter is for it to be something that users can easily format and render for users, would it make sense to just make this a String instead?

Granted, it's not as exact semantically as Character, but we don't have very many APIs outside of String itself in the standard library that vend Characters specifically. A single-code-unit String in Swift costs no more than a Character would, and there's a lot more you can do with a String immediately vs. a Character where some APIs might require you to stringify it first.

Since this is not a mutable property, we don't have to worry about validating the length of it; we can just document that the String always has a single character but that we use String for convenience.

1 Like

Its inline storage is much bigger, which is probably not a significant problem for the typical use cases of "drive letter" but is non-zero.

Unless I'm misunderstanding you, a Character is literally just a wrapper around a String. A single-character String and a Character have the same smol-string memory representation (unless there are any degenerate situations where a longer string is whittled down to a single character in a way that it retains a heap-allocated buffer).

UnicodeScalar seems likely to be the best option; it's smaller than both Character and String, representable by a literal, and is easily convertible to both CodeUnit and String.

4 Likes

Oops, I was thinking of UnicodeScalar. Carry on!

I'm questioning whether you can do anything useful with "correct" path parsing. What useful functions will I be able to write on the properly-parsed FilePath that I can't write on a collection of naive components? Where "useful" means "a function that someone who understands paths would recommend using". The only one I can think of so far is "pop the last path component, including a ..namedfork suffix", and I'm not sure that niche use / abuse of syntax by macOS justifies increased complexity in the currency type.

(This is similar to my argument that String should not be a Collection, which I lost back in Swift 4.)

This is me forgetting the extended syntax for Windows paths, but I think the existence of \\?\UNC\srv\share remains an argument for "Microsoft could add a new kind of anchor next year". I mean, Microsoft well knows how much of a breaking change that would be, but it does mean we're signing up to deal with it forever.


All of this API surface makes sense for System, which is full of platform-specific, non-portable APIs. I'm not convinced it makes sense for the stdlib, which is nearly always quite uniform. Sure, you can successfully classify the parts of a path. What can you do with it? Not very much without additional knowledge.

2 Likes

Because there isn't enough pedantry here… parsing a path is easy. Interpreting a path is hard and may require I/O, network access, syscalls, and whatnot.

If we cannot offer basic path manipulation API in the initial version of FilePath without a lot of platform-specific elbow grease, then we shouldn't try to offer an inherently defective substitute. That does not preclude offering a bare-bones, C-friendly-ish FilePath type for representing a file path distinct from a string, but it does severely limit the utility of said type in the short term. And it doesn't preclude having System extend a stdlib type with platform-specific functionality like driveLetter or forkName or whatever.

I don't think I'm saying anything that hasn't already been said though, so feel free to ignore me. :melting_face:

1 Like

These particular VFS hacks might warrant special treatment since their handling is baked into the XNU kernel, and they have either a long tradition (like .vol) and/or security implications if they are treated ignorantly like regular path components (like .nofollow/.resolve). But these aren't the only VFS hacks in the wild (for example, ZFS volumes on Linux or FreeBSD resolve .zfs/snapshots/<id> relative to the volume mount point as a way of referring to saved older versions of the volume), and I could imagine OSes and file systems wanting to introduce more over time, so this does seem to put us in a position where we're signing up for being aware of OS-version-sensitive, or even file-system-sensitive, behavior. How do you see us handling those?

6 Likes

One comment I do have here is that IMO driveLetter is the wrong choice anyway. It should be deviceName and it should be some kind of string, at which point I think it's broadly portable (it's really only UNIX-like systems with a unified filesystem that don't have a device name, and most systems that do have it support relative paths without a device name anyway, so returning it as optional makes sense). While DOS only supports single character device names, the same isn't true of all DOS-like systems, and there are other platforms where devices were identified by name too. For instance:

SystemExample pathsDevice names
DOSC:\WORD\README.TXT, CON, LPT1C, CON, LPT1
Atari TOS/MiNTC:\DESKTOP.INF, U:\C\DESKTOP.INF, U:\MyDevice\ReadMe.txt, U:\bin\ls.ttp, CON:, AUX:C, C, MyDevice, U, CON, AUX
CBM AmigaWork:Projects, DEVS:Monitors/VGAOnly, RAM:MyFolder/ReadMeWork, DEVS, RAM
MacOS (classic)MacintoshHD:Work:To Do ListMacintoshHD
Acorn RISCOSADFS::Disc3.$.Work.Projects.ReadMe, Net::Server.&.Documents.HelpADFS::Disc3, Net::Server
TOPS, VMSDRIVE1:[WORK.PROJECTS]README.TXTDRIVE1

The Atari TOS/MiNT example is an interesting one, because MiNT added a "unified" filesystem device, U:, under which all other devices are mounted; the idea here was to provide a more UNIX-like experience on top of a DOS-like base. In addition to devices, you can create links to directories in U:, so U:\bin is likely a symlink rather than a bin device.

Acorn's system is also unusual in that on Acorn platforms the path starts with the name of the filesystem driver you are using, in the examples above ADFS and Net respectively, but I think we can safely amalgamate these with the device name.

6 Likes

Arguably (because arguments), the device name on a UNIX-like system is the mount point of the device/volume.

Not that it would gel well with drive letters as a single abstraction.

Thanks for this, it's a great opportunity for us to clarify some things. We've been tossing around the term "syntactic" and there's actually a few distinct concepts/layers involved.

The kernel defines path syntax. That path syntax is parsed by the kernel prior to and regardless of what file systems in what configurations actually exist at what locations. Linux just uses vanilla POSIX path syntax. Darwin recognizes an additional set of prefixes (and resource fork suffixes). Windows has its own forms. FilePath's path decomposition follows how the kernel parses paths.

After parsing a path, the kernel has some information from the parse and some opaque blobs of bytes to send off to the file system driver for further handling. The kernel syntactically recognizes . and .. during traversal (hence Component.Kind), but otherwise components are opaque byte blobs. What those opaque blobs mean or what they do is outside the knowledge of the kernel and isn't universal on that platform: it depends on what file systems in what configurations are in what locations, which can even change while a process is running.

For .zfs/snapshots/<id>, the kernel parses it into a sequence of opaque byte blobs [.zfs, snapshots, <id>]. If the location resolves to a ZFS volume with snapshots enabled, those components reach the ZFS driver and get interpreted as a snapshot reference. On any other filesystem, they're just a regular path (that may or may not exist). The kernel doesn't distinguish .zfs from .foo at parse time and the meaning can change at runtime if filesystems get mounted or unmounted under the process. This is not true for /.nofollow/foo/bar on Darwin, as XNU parses and extracts meaning directly from the /.nofollow prefix before any file system driver is invoked (and in fact, .nofollow isn't even passed to the driver).

There may be high-level application or framework conventions (e.g. app bundle internal layout) and those are outside the scope of FilePath decomposition. They likely belong in Foundation or higher.

"Syntactic" is a fuzzy/imperfect term. It's totally within the scope and convention presented here for future API to tell the developer that 3 as a resolve flag on Darwin means to not follow symlinks and not follow ... This is information parsed off and understood by the kernel regardless of file system concerns. Whether that is "syntactic" or "semantic" information is "just arguing semantics" if you will. Also, while FilePath as proposed treats the contents of components opaquely (like the kernel), there may be future convenience functions such as stem and extension that are "syntactic" in that they just look at the bytes without FS-specific knowledge, though whether they carry any true meaning is very much domain-specific (these have been deferred for future consideration).

We are intentionally deferring most non-syntactic operations due to atomicity concerns, as nearly anything involving paths and file systems gets really racy really fast. We probably want a different technical foundation for much of this than, say, paths. We are still proposing the resolve() primitive (discussed/debated earlier in this thread) as a temporal fence between the two layers.

I would say that it's probably the role of a different module than Swift and a different API design pattern to surface things like "assuming the file system is X in Y configuration, is Z applicable to this path". Whether that is expressed via extensions to FilePath, wrapper types, or however is outside the scope of my current thinking for this initial FilePath in the stdlib proposal.

On the kernel-parse layer evolving: Linux is trivial and unchanged. Windows's syntactic forms have been stable since Windows NT 3.1's release in 1993 (noting that the rules for interpreting legacy DOS device names shifted between Windows 10 and 11, but that's outside FilePath's decomposition). Darwin is the one platform where the kernel-parse layer has grown over time, but Apple ships the stdlib with XNU, so FilePath tracks kernel changes directly.

(As an informal aside, I am in communication with XNU maintainers and their intent is to not evolve the path syntax further, rather to build on top of the existing forms. Obviously I cannot speak on their behalf or make guarantees, but regardless, FilePath is very well positioned to co-evolve should the need arise)

5 Likes