[API Review] FilePath Syntactic APIs

Foundation.URL sidesteps this nicely: rather than appending path A to path B, you resolve path B relative to path A. With that framing, the situation where B is absolute is less of a surprise. Unfortunately, I have no idea how to spell this as a mutating operation on A…but if someone comes up with a good name I think append is worth replacing.

4 Likes

I’m excited to have low-level Swift-native path operators!

Is there some situation where path.isAbsolute == path.isRelative?

If not, it makes more sense to me to keep only one of these (I would suggest isRelative since it’s the diminutive case), or — more clunkily — to have this return an enum { .absolute, .relative }.

It’s not Swift library or Cocoa standard to have two opposing flags for the same underlying value.

-Wil

2 Likes

Just want to make sure that you're aware of GitHub - mxcl/Path.swift: Delightful, robust, cross-platform and chainable file-pathing functions.

It does a very similar job, is also cross-platform, and has been through several iterations.

I think there is probably a lot of knowledge and experience which has been embedded in those updates.

3 Likes

Correct, Windows is interestingly different here. It has the notion of 26 volume roots (A - Z). Each one is a separate stack, and you explicitly change between them. However, the operation occurs on the appropriate volume. A change of directory operation is always rooted at the currently selected volume, but an absolute path (which includes the volume, as opposed to a root-relative path which is not absolute) performs the operation on the specified volume without changing the volume selection.

I can see see why it may be confusing for some to append a "/bin" to concatenate the string literal, and I can see the use of it too - you may have a path which you are not guaranteed is path separator delimited and wish to append a component to it.

I feel like the current behaviour makes sense, though I could also see an argument for appending an absolute path to an existing path to be an invalid operation and perhaps we should make this operation throw in that case. (Seems that @finestructure made the same comment)

1 Like

Relative paths are an interesting subset of paths (not covered by this initial proposal). There are operations which could benefit from a more thorough modelling of the behaviour of relative paths. By keeping both here, we have the opportunity to extend this in the future with a richer vocabulary for working with relative paths. In fact, if you look at swift-tools-support-core, you will find that it found it useful to distinguish AbsolutePath and RelativePath.

This "duplication" is an expression that relative paths are relative paths rather than paths which are not absolute.

1 Like

Maybe .appendRelative(...) or .append(relative: ...)?

Although it should probably still throw if the path is absolute.

1 Like

In general I think this looks really great! I haven’t had time to dive really deeply into it yet, so I apologise for this in advance if it’s addressed, but I do have a specific feature I’d like us to consider adding here.

Path traversal attacks are one of the most common logical security flaws in software. They regularly affect server code, and also sometimes crop up in somewhat unexpected places elsewhere. These are usually triggered by resolving a file path derived from remote input and making one of two mistakes:

  1. Not validating that the resolved path is within one or more valid directories, or;
  2. Not normalizing the resolved path in the way that the OS will, e.g. forgetting to collapse .. or mishandling drive letters

While we can’t defend against users forgetting to do this, we can make it easier for users to avoid these pitfalls by having a FilePath.contains(otherPath:) or FilePath.isContainedWithin(parent:) or some similar method. This would automatically normalize the paths and then perform the appropriate comparison. This would make it much easier for programmers to defend against this surprisingly common bug form in a genuinely effective way.

9 Likes

It's a bit hard challange for me the Swift language

I like the idea of throwing if we keep the API as-is. If we can avoid that with a different API, that would be best, especially for any future APIs that interact with the file system. C+17 has those operations as free functions in the filesystem namespace, while Rust has them methods on Path. If we do end up putting those operations directly on FilePath as methods, it would be nice if throws was a syntactic hint that an operation consults the filesystem (in addition to being called out in the documentation).

But, I do like the idea and also like calling it something other than append. I went with append because Swift uses it for similar operations (albeit over homogeneous collections) and there was precedent in C++17.

Beyond broad precedent in other languages, I think we do want some operation that has this cd-liked behavior. The operation models the path as a stack that we push batches of values onto, where there's a special sentinel value that clears the stack prior to pushing the rest of the batch. I'm leaning heavily towards a name that has "push" in it, like Rust's. Hopefully a better name there will avoid issues with "combine" or "join" in other languages, for which there is significant user confusion.

  public mutating func push<C: Collection>(_ c: C) where C.Element == Component

If we can split root off from Component, then there is no issue -- we've hoisted the "special sentinel value" consideration into the programmatic level (users explicitly clear the path prior to a push). If we cannot, then at least there will be a deliberate choice between components and relativeComponents (which was future work, but could be justified if it eases this problem).

Another observation is that this behavior seems to be the most surprising when using strings or string literals. I wonder if we should have an overload with looser semantics for a String that will ignore a leading slash.

3 Likes

Yes, I also saw this in Java, though the word "resolve" often carries connotations of resolving symlinks as in pathlib.

4 Likes

That was definitely considered and debated. I think it depends on how firmly reified the opposing values are in the given domain. On one extreme, use of the word "not" is a strong hint that you should not name the opposing values (e.g. isNotEmpty would be avoided).

I feel like relative and absolute paths are strongly established concepts on their own. Even though they can be defined as "not absolute" and "not relative", they tend to be used and documented with this established terminology. It's also not always clear which should be the canonical name, e.g. FilePath.relativePath and a ComponentView.relativeComponents exists.

(This is all under the assumption that "absolute" means "fully-qualified" on Windows, otherwise they are not opposites.)

For example, the components of a relative path can be treated generically as a RangeReplaceableCollection as it is effectively homogenous, which is not true of absolute paths. And again, we'd have to be explicit about Windows roots (which can be present on a relative path); such a type would just be the components beyond the root.

2 Likes

I am definitely aware of that! It will be more relevant as a comparison when we tackle APIs that interact with the file system.

3 Likes

I am very interested in guarding against this and any way that we (reasonably) can. For example, we normalize separators on construction which includes dropping any trailing separator. If that means rsync needs to model different behavior by adding a Bool rather than munging it into the path representation, that's for the better.

(Similarly, it's worth it even if it makes super sketchy Windows practices involving directories named after DOS devices more explicit).

Beyond that, we are very explicit what normalizations are lexical (i.e. do not consult the file system) and which are not (those are future work).

Would this be a lexical normalization or would this be realpath? There are multiple levels of "normality" that might be desired by a tool (taken from the gist):

5 Likes

Generally it is sufficient to use lexical normalization. OWASP's guidance uses the Java normalize() function as an example of something to use to combat path traversal, and that's really what we're shooting for.

The presence of a symlink is always going to be a bit of a wrinkle, but in general while symlinks do allow escaping from the specific directory tree they don't allow arbitrary path traversal. They glue a different directory tree into this one. You're only at risk if for some reason you symlinked / into your directory tree, and that's a pretty bad idea to begin with.

This is definitely not sufficient to fully protect oneself, but it's a good tradeoff of performance (no syscalls!) and safety. If we make it easy to do, we can incentivise users to actually do it.

6 Likes

I know you're exploring other options as well, but throwing on append strikes me as a little heavy handed. If appending a rooted path is disallowed, what about returning a boolean like SetAlgebra.insert(_:) and Collection.formIndex(_:offsetBy:limitedBy)?

So this would be a "do nothing, return false" variant.

1 Like

Ya I think I'm just voting that throwing feels like a disproportionate response for a special case of a syntactic operation. Do we have any guildelines written down for when to throw errors vs return an Optional/Bool?

Update: I'm working on a new prototype and some API changes that explore a bolder, but I think cleaner, solution.

  • Root is a separate type from Component, i.e. components are non-root folder/file names
    • This also makes it clearer for future Windows root parsing/analysis APIs
  • FilePath.inits taking components now have an optional root parameter with argument label.
    • In my use so far, these more explicit initializers seem to improve clarity.
    • Also adding FilePath.init(root: Root?, components: Component...)
  • FilePath.ComponentView is similarly a view of the relative portion
    • It now conforms to RangeReplaceableCollection for less common but useful things like insertion-in-the-middle
    • It no longer has root, relativeComponents, or dirname, as roots are irrelevant to this view
    • It no longer has basename, which is now just .last
  • FilePath.components might be renamed to FilePath.relativeComponents, to avoid confusion
  • FilePath.Component has a Kind enum and we drop isParent/CurrentDirectory()
  • append/pushLast overhaul:
    • pushLast(_:Component) is renamed to append(_:Component), which cannot be a root
    • We add append(_:String) which can treat a leading slash as a separator instead of a root, if applicable
    • Previous append(_:FilePath) is renamed to FilePath.push(_:FilePath)
      • This implements the cd-like stack semantics, where pushing a root clears anything before it in the stack.
  • Some renames: stripPrefix -> removePrefix, popLast -> removeLast

Some questions I'm considering:

  • Should FilePath.Component be renamed FilePath.RelativeComponent?
    • That seems onerous: there's nothing "absolute" about components themselves. There's more so the question of root-or-not.
  • Should ComponentView be RelativeComponentView?
    • It's rare to type the name itself and this is a view of the components of the relative portion
  • Better name for FilePath.push()'s cd-like, file system navigating semantics?
    • navigate(to:) sounds like something is happening beyond a syntactic modification of the path
  • We're going with lexicallyNormalize() and isLexicallyNormal, which mirrors other words like canonicalize() and isCanonical, but there could be an argument for isLexicallyNormalized.
  • What happens to componentStrings?
    • It's convenient (even more so when root is separated out), but is the name confusing?
    • Alternatively, it could be dropped or deferred.
  • The dirname / basename naming debate
    • basename is now simply relativeComponents.last, but makes sense to keep if we keep dirname
    • dirname has precise semantics, while something like parent would mislead (worse) for a path like /bin/ls/..
    • Swiftifying the name to directoryName would establish a false intuition that it is a directory, or even the directory immediately above (bin in usr/bin/ls) and would put distance from its precise semantics
  • Non-global-namespace for PlatformChar and PlatformUnicodeEncoding
    • Don't necessarily want to "steal" the name "Platform" from downstream code, but it makes sense for System
    • We're likely to add a enum CTypes {} soon anyways, so we could host them there.
  • Consider adding lexicallyContains or similar as @lukasa requested
  • Ability to construct Windows-style paths on Unix and vice-versa
    • This would be useful for scripts that output paths as text for different platforms
    • While these could be print methods, they wouldn't allow for Windows root behavior on Unix
      • E.g. \\server\share\ is just a root, so removeLast would return false on Windows
    • Internally, we test by having the choice be contextual, driven by thread-local storage. This has some problems if we make it general use:
      • Thread local storage would be incompatible with task-based concurrency
      • Even if we use (proposed future) task-local storage, we have to worry about escaping paths
    • Storing a bit on FilePath complicates initialization and conversion
      • E.g. what should a string literal use?
9 Likes

I very much like these changes. The whole API is impressive so far, and separating the FilePath.Root concept into its own type makes a lot of sense to me. I also like the rename of popLast, since pop to me implies removing and returning a portion of the collection.

Has hosting these in FilePath been considered? FilePath.PlatformChar is a bit wordy in a vacuum, but makes sense to me as "the smallest unit of a file-path string on the current platform that I can type on a keyboard." FilePath.PlatformUnicodeEncoding is similarly descriptive. Does Windows or Unix use different unicode encodings for different contexts? If not, is it possible that Swift may in future support a platform that does? Nesting these under FilePath expresses the file-path context for these abstractions now and in the future.

We might rename these to FilePath.Char and FilePath.UnicodeEncoding for brevity. Similar to how Swift's Int type has an architecture-dependent size, FilePath.Char would have a platform-dependent underlying type, and therefore size. FilePath.UnicodeEncoding seems also to fit that paradigm to me.

4 Likes