[API Review] FilePath Syntactic APIs - Version 2

typesanitizer · January 29, 2021, 1:24am

I apologize if I missed it when reading the proposal, but what is the reason for allowing an empty FilePath to be constructed? I see you mentioned

in a comment, is achieving that the reason empty FilePaths are allowed?

I ask this because it seems like one could forget to check for the isEmpty case after using certain functions (such as when using removingLastComponent()) and hit a problem later.

Michael_Ilseman · January 29, 2021, 1:59am

Non-emptiness is not really an invariant we can enforce in a developer-friendly fashion. ComponentView is a RangeReplaceableCollection, meaning that a path that doesn't have a root could become empty as the result of any number of generic (i.e. indifferent to it having been a path) operations. So would root's setter, etc. If we try to enforce the invariant by trapping (which is what the other invariants do), then we place the burden of predicting whether an operation might result in an empty FilePath on any call site to a mutating method or any generic algorithm you might want to run against the ComponentView.

I feel like normalizing empty paths to . would produce all kinds of weirdness for any kind of programmatic treatment, while the empty path is much more of an identity than ..

What kinds of problems are you envisioning? FilePath.isEmpty is not necessarily relevant to removingLastComponent(), as that will preserve any root, but FilePath.components.isEmpty would be.

typesanitizer · January 29, 2021, 2:37am

I didn't realize this before, that makes sense.

I was thinking about this example without the root: foo => "". In this case, it seems like it's possible one that the behavior a developer wanted was "get the name of the directory this is in", but if those are the semantics they wanted, it's likely their code has an error and they shouldn't have called removingLastComponent in the first place (or alternately, they should remember to check for the FilePath.isEmpty case later).

wes1 · January 30, 2021, 6:04am

Hi everyone,

New here - Thank you for operating in the open, and for the clear proposal!

Minor Kind placement/naming/case suggestions...

Given the proposed Component and Kind:

public struct Component {
  public enum Kind {
    case currentDirectory
    case parentDirectory
    case regular
  }
  public var kind: Kind { get }
  public var isSpecialDirectory: Bool { get }
}

Should isSpecialDirectory move inside Kind?

Filters/closures could then target (immutable) kind instead of (mutable?) component. I prefer knowing isSpecialDirectory only takes a kind to calculate. (Users might expect Component.isSpecialDirectory to detect other "special" directories like .local by inspecting the name.) Also I can imagine other Kind predicates for later, and wouldn't want to clutter the Component namespace.

As for the names: "special" and "regular" feel vague. Current and parent directory may seem to refer to the directories rather than the referent. I had to read the comment to see they mean . and ..

This Kind really refers to the form of Component text; perhaps Syntax? The code captures very literal/specific conventions about the actual name (not the reference).

How about:

public enum Syntax {
  case name
  case dotName
  case dotSameDir
  case dotParentDir
  public var isDotDir: Bool
}

Awkward, yes -- but clear, no?

dotName is useful; e.g., you use it to implement the invariant that stem is never empty:

(kind == .dotName) <==> special stem rule

And perhaps case nameDotExt? This also helps stemming and might help e.g., a SwiftPM scanner filter fast-fail to avoid inspecting component text:

scan(...) { c in c.kind == .nameDotExt && ... }`

Thanks,

Wes

P.S. - Yes, I accept that Directory is considered clearer, but dir feels more syntaxy and I doubt it confuses anyone, so I threw it in :)

Michael_Ilseman · February 1, 2021, 7:55pm

This is an interesting suggestion, especially as we add more component analysis/parsing APIs (things like hidden files, multiple extensions, or even resource forks)

Currently, API is immutable and Swift is very helpful about making sure you don't call a mutating method on a let. If Swift later adds mutable borrows, such that you could iterate-and-mutate components, then we might add mutable methods on Component.

Since it's a single API that can be serviced entirely with a == on the Kind enum, I'd say we drop isSpecialDirectory for now to avoid confusion. We can add it back in if/when we have additional queries.

At some point, System as a systems-programming library needs to directly represent systems concepts. Path components (file-system-agnostic) have 2 special kinds with specific meanings, and otherwise they are handled by application or file-system logic.

I am interested in any improvements to naming. I went with regular as less-bad than normal (which carries normality and other connotations). I think name as an alternative to regular is interesting, though I think antonyms to special are semantically clearer to serve as "the kind of path component". component.kind == .regular is a little clearer about excluding special directory components than component.kind == .name.

I'm not sure I understand what you're referring to. Can you elaborate?

We will want to add component analysis APIs for application-specific processing, but we definitely should not conflate those with the component kind. System is shipped on Darwin as an ABI-stable library, meaning that enum cases are permanently closed. open enums are even more vague and scary to a developer, as there is no way to program with them exhaustively.

I'm not sure what you mean. Hidden files can have extensions. These concepts are independent/composable and probably should not be expressed exhaustively.

Then we really, really do not want to conflate semantic kind with syntactic info in an ABI-stable closed enum. What if we want to have a multiple-extension form? What if we later want to add alternate data stream support?

The high-order aspect of processing a component is recognizing the "kind" (I'm at a loss for a better word here) of component, for the purposes of exhaustive including or excluding of . or .., which have defined semantic meaning.

For the components that are not . or .., aka "regular" here (again, at a loss for a better word), there can be further domain-specific refinement of behavior. This doesn't have to be presented with exhaustivity.

wes1 · February 3, 2021, 4:26am

Hi -

Thanks; I think I'm beginning to understand your concerns/constraints:

Agreed, so:

. and .. are relevant
.name* and name.ext are targeted cases, but perhaps not enums

I realize I was attracted to an idea of Path as lexical [Component], distinct from semantic FilePath. This idea is likely out of scope now.

For those interested...

Reading the proposal I really liked the distinction between Root and Path. But the discussion was troubled by the awkwardness around lexical v. non-lexical, absolute and relative, escaping roots, etc. that to me seemed to be driven by collapsing all the functionality onto FilePath, instead of treating var components : [Component] as a general-purpose Path, with only syntax/lexical operations.

Windows root strings may have prompted the creation of the Root/Path distinction, but I saw it in light of Java's move from File to (Path plus Filesystem), i.e., roughly to logical config plus physical validation and construction. That seemed helpful from the tools perspective.

Let's say Path handles only lexical/logical operations on segmented text and Filesystem handles physical file character/name validation, link detection, etc. for a given volume.

Users may create relative paths, which can be converted to absolute coordinates using some basis like the "current" directory.

Basis can have a Path relative to a Filesystem root. It captures the Windows cases of multiple drives, different CWD on each drive, etc. and unixy volumes and mount points, and would perhaps support chroot/jail or testing FilePath equivalency by detecting when different root specifications refer to the same place.

Overall:

... FilePath {
    var basis : Basis
    var path : Path 
}

... Basis {
    var path : Path 
    var root : FileSystem
}

... Path {
    var segments : [Segment]  // originally [Component]
}

FilePath, then, coordinates a Path and a Basis (and its Filesystem) for path modification, characterization, normalization, validation, proliferation, etc. FilePath API could just publish variables path and basis, or it could add operations for convenience, consistency, status, etc. I assume Path is public, shared, and immutable.

Some outcomes of this world:

(1) Path replaces FilePath's lexicallyResolve, lexicallyNormalize, and removePrefix:

... Path {
 func resolve(_ path: Path, _ within: Bool = true) -> Path?
 func normalize() -> Path?
 func relativize(_ path: Path) -> Path?
}

(For name removePrefix, relativize is borrowed from Java Path.relativize(..)):

For any two normalized paths p and q, where q does not have
a root component, p.relativize(p.resolve(q)).equals(q)

Then e.g., FilePath.resolve(..) would handle links and character/name validation
(delegating to Basis/Filesystem), optionally preventing escape.

(2) There's no confusion about replacing roots since path operations only take Path.

(3) To "rebase", create a FilePath from the Path with another basis.

Tools or System?

But Path has other (non-File) uses that push it towards a top-level type, possibly beyond the goals of the System package (or this revision).

Path also works as URI path component, archive internal paths, property paths, etc. This commonality makes it easy for users e.g., to calculate the expected filesystem path of an exploded archive or server URL; changing schema is changing basis.

Also, tools often use path expressions (e.g., .gitignore **/skip/, XPath /Author//Book/@title). For that I believe we need just two enum cases:

public enum NameSyntax { // AKA Component.Kind
  case empty
  case expression
  ...
}

These aren't useful for system calls, but might support tools making those calls.

Path would not detect expression; an interpreter would recognize and
set the component/segment kind (whoops: mutable, or schema Path factories).

Path would detect the empty case and might want to even for FilePath purposes. Currently to avoid file bugs FilePath normalizes '/' on initialization (removing trailing / and n>1 / in series) -- even though it waits to normalize . and ...

Supporting empty would enable Path to

adopt input without mutation (i.e., as a "shared value"?), with possibly-lazy parsing
support round-trip's between path and its input
define absolute paths: var isAbsolute Bool { segment[0].syntax == .empty }
handle XPath /Author//Book expressions
be used in Basis for //server expressions

Having Path immutable/shared/lazy makes it less burdensome for combining operations like resolve, relativize, rebase, et al to take Path parameters instead of String, likely simplifying operations and consolidating any conversions in initialization.

Having multiple tool consumers of Path drives kind further in the direction of (input) syntax buckets for (system/tool) semantics. Getting rid of "current", "parent", "directory", etc.:

public enum NameSyntax {
  case empty      // ""
  case dot        // "."
  case dotDot     // ".."
  case dotName    // ".name*" including .name.ext
  case nameExt    // "name*.ext" (not "name.")
  case name       // "name" or "name."
  case expression // per tool
}

In any case, I recognize this FilePath/Path split and scope creep is likely late and out of bounds now, and there's no need to discuss. Thanks for your patience!

Ponyboy47 · February 6, 2021, 6:23pm

I finally took the time to read through the full 2nd version proposal text and I'm really liking the changes you've made!

I like how you've split Root out from Components and handle those separately.

I'm also a fan of the new lastComponent and removingLastComponent in lieu of basename and dirname. I think those names feel more swifty to me.

I do have some concerns about the lexical operations. I understand their importance, but I am curious how the filesystem versions of those operations are going to end up looking and if most people will even know about or use the lexical operations once they can normalize against the filesystem. Both kinds of operations are important and each has their own benefits, but I worry that once the filesystem APIs are in place, very few people will ever look at the lexical operations.

Have you thought at all how the API would look for performing operations against the filesystem? Will there just be a handful of operations on FilePath that consult the filesystem or will there be some FileSystem object like the current FileManager which is used to communicate with the filesystem? It may be confusing for people if the two kinds of normalizations are found in separate places, although I think the docs for the lexical operations are clear enough about what they do and how they behave.

I think that at the end of the day, most people will prefer to normalize paths against the actual filesystem (since that's kind of the point of a file path). I just want to ensure that both kinds of normalization compose well in relation to each other.

I keep going back and forth in my head about whether I like having both push and append or not. I like how append ignores the root and just "appends". I think this behavior is clear, expected, and should reduce bugs (coming from the python world where I've been bitten by "appending" absolute paths). I personally have never needed the behavior of push and I am curious to see how common it is and if it's really worth including the API at this level? It seems easy enough for anyone who needs that behavior to implement it themselves.

My concern is that people will begin using these APIs, have FilePaths all over their codebase and then when they reach for append and find out that it doesn't work with a FilePath so the call push instead and end up with unexpected bugs, just like you would in Python or other languages.

I see your rationale about how it is more acceptable to ignore roots from a String than from a FilePath (and I agree), but I personally think that at the end of the day most people are going to reach for the strongly-typed FilePath and end up using push anyways.

I don't really have a better solution in mind, I just wish there were a way to get the best of both worlds to help newcomers avoid these weird kinds of bugs.

This makes me curious about the array/set remove operations. Do those also trigger COW copies and if so, why not just return the removed component to match their behavior?

If the result is marked as @discardableResult will a COW copy be triggered even when the result is not used?

+1 to the CInterop enum/namespace

Glad to see this is just deferred. I still hope to someday see type-safe absolute/relative paths but I agree it is a problem beyond the scope of these syntactic APIs.

All in all I think these APIs are coming together well! Can't wait to see them fully implemented and accepted into the System library.

Michael_Ilseman · February 10, 2021, 3:50pm

We haven't tackled how the file system touching operations will end up looking yet. C++17 does free functions within the filesystem namespace while Rust has them hosted directly on the Path type. While I do like the delineation that C++17 has, we're more likely to put them on FilePath as that fits Swift's feature set better and is more chainable. Such operations would be throws since they touch the file system and the fact that they consult the file system will be called out and highlighted in documentation.

The two formulations are semantically equivalent and neither has better or worse support for in-memory mocking (e.g. for testing). Mocking is yet another topic for a different discussion.

The proposed lexical operations have "lexically" in the name to carefully distinguish them from future alternatives that consult the file system. This means that the shorter names are available for the more common or more correct/safe alternatives. If we go with a different scheme and want to shorten the lexical operation names, going from a verbose name to a simpler one is a gentle deprecation.

Lexical operations are useful in a context where consulting the file system would have unacceptable performance issues, such as a build system or a server. In that case, the user (who is more of an expert) is carefully checking the docs to determine which operations consult the file system and which don't. The docs for the non-lexical operations can link to the lexical ones (and vice versa). We also have the throws hint that something beyond the library itself is consulted. The user would need to learn that the term for that behavior is "lexical" in order to naturally find the right APIs.

There is still the potential for confusion, but I think lower with the choice of names and types.

We have differently named methods that take different argument types, which helps make the distinction. append is used for the standard Swift operation that produces results equivalent to a homogeneous collection append operation. Users see that append takes a collection of components, and likely know that paths have components, so it isn't too far of a leap to go to

path.append(other) -> path.append(other.components)

Making the leap of path.append(other) -> path.push(other) is invoking a differently named method, push, which does not have Swift connotations of homogenous collection appending. Swift prefers overloading rather than adding a completely different method name, so it isn't very common to switch method names with the expectation that the semantics are the same (and there are overloads for append already).

At some point, a user does need to read the doc comments for the methods they call if they have non-standard Swift names. They will eventually have to learn that paths can have roots. I think having the name and type separation helps point them in that direction. If there's a name that's clearer than push, then I'm open to that as well.

No, they do not. Array and Set physically store their elements, while Component itself is not actually stored. It is a slice of the FilePath's byte storage and created by subscript. Since Component is a safe type, it has a strong reference to the storage. Removing a component will delete the bytes that it is a slice of and if the component is still alive, a copy of those bytes has to happen.

This isn't true of Array or Set, except for the bizarre case where their element type is a slice of themself (in which case, mutation would invalidate the underlying indices anyways, making this a moot scenario).

@discardableResult suppresses the compiler warning, but does not change how the library is compelled into binary. System is separately compiled from user code so the return value is still returned in register. (These could be made @inlinable, but then we would choose a different name, as removeLast even in Array/Set doesn't return the removed argument. We'd probably go with popLast).

Avi · February 10, 2021, 3:57pm

What about replace(with:)? I think push also implies a pop operation, which of course does not apply.

Ponyboy47 · February 10, 2021, 8:36pm

I look forward to the proposal for that functionality! I'd like to do some exploration with strongly typed open paths (eg: opening a FilePath to a file gets you File, while a directory gets you Directory, or Socket, Character, Block, etc). Some of the opened file descriptor logic is sharable but each file type has its own specific API that I think we could illustrate well with Swift.

I like that the shorter names will be more easily discoverable, understandable, and usable.

Good to know what those contexts are for using lexical operations. Thanks!

So that makes a FilePath more like a String and the FilePath.Component is basically String.SubString?

How does that work when this pattern is used:

var parent = FilePath("/tmp/what/ever.txt")
let filename = path.lastComponent
path.removeLastComponent()

Does that mean the filename variable will introduce a copy? If so, when would that copy happen? I looked at the implementation of FilePath.components but am not fully understanding it.

Gotcha.

Ponyboy47 · February 10, 2021, 8:42pm

I think the behavior for a function named replace would be confusing in this scenario:

var path = FilePath("/tmp")
path.replace(with: "dir/file.txt") // path is "/tmp/dir/file.txt"

I don't really have any better suggestions for the push function name though.

xwu · February 10, 2021, 9:07pm

I've not had the occasion to use pushd on the command line and I don't think users will confuse push with that functionality, but nonetheless, I think we have an opportunity to hew a little more closely to familiar terminology: as you describe, the method is meant to have cd-like semantics, so why not name it something like change(to:)--or if one would like to be even more explicit, changeDirectory(to:).

Ponyboy47 · February 10, 2021, 10:01pm

pushd and cd behave basically the same, but pushd maintains a stack of previously pushed directories so that you can popd up the stack to return to previous locations.

EDIT: cd does store the previous directory which can be navigated to using cd -:

pwd # /home/myuser
cd /tmp # Now in /tmp
cd - # Now in /home/myuser
cd - # Now in /tmp

Given how FilePath does not have a pop method or maintain any sort of stack information it may be more clear to use an alternative name to push.

I do really like the change(to:) name. I think changeDirectory(to:) sounds to me like it could result in the filename staying the same but the parent directory changing.

Michael_Ilseman · February 11, 2021, 12:19am

I'm not sure what we're changing from or to with that name, but I do like the connotations from cd. An alternative I was considering was navigate(to:), but I want to avoid implying that there is any kind of effect happening, such as changing the current working directory.

stevapple · March 6, 2021, 5:34pm

Looking forward to this API set. The current Path implementation in swift-tools-support-core is really messy… FilePath is world-saving for Swift projects (especially on Windows).

stevapple · March 7, 2021, 3:07am

A problem here: Why this returns true? I suppose "/usr/bin/ls///" to be normalized as "/usr/bin/ls/", which is by no means a prefix of "/usr/bin/ls". Exchanging these two will get true instead.

Michael_Ilseman · March 7, 2021, 3:08am

Trailing slashes are stripped as part of normalization.

stevapple · March 7, 2021, 3:21am

I’m afraid trailing slashes has its special meaning on Windows. Even on Linux/Darwin, I think it’s still meaningful (by indicating the path is pointing to a directory).

Michael_Ilseman · March 7, 2021, 4:33pm

Windows roots are normalized differently than the rest of the path. Partial roots will even have backslashes added to them to complete them. So the following are examples of valid Windows roots, which can occur in isolation or preceding normal path components.

C:
C:\
\
\server\share\
\.\Volume\
\?\UNC\server\share\

Since those are roots, they all keep their trailing backslash if they appear without any components after them (and System will actually add the backslash if it's missing). Beyond roots, do you have an example of special meaning?

Not to the operating system, AFAICT. If you have a counter example, please share it. The trailing slash does not carry the meaning that the path is a directory, counter to widespread-intuition, and that is precisely why System normalizes it away.

Higher-level tools and interfaces (e.g. shells, rsync) may layer extra semantics on top of what the operating system does prior to producing a path for the use with the operating system. Examples include environment variable substitution, tilde-expansion, and using trailing slash to designate different behavior (i.e. nest within a destination). These tools interpret (and sometimes modify) their input bytes prior to invoking calls in the OS (i.e. forming a proper path).

System's aim is to provide Swifter, stronger-typed interfaces for systems programming, and Swift is a different language than C. It's common in the C world to use different values within the same representation to carry semantic weight (e.g. 0 means an invalid pointer and -1 means the syscall failed). Swift uses Optional.nil instead of 0 for null pointers. System throws Errno on failure rather than returning -1. FileDescriptors, Errno, flags, etc., are separate strong types rather than just Int32.

Similarly, System doesn't overload FilePath's representation with a significant trailing slash that will be interpreted wildly inconsistently. A better model is have use Bool or an enum to carry the semantic intent prior to producing a path to give the OS.

That is, just we try to avoid this in Swift:

if value == -1 {
  // Different code paths and different interpretation of 
  // what `value` and other entities signify
}

we also avoid

if path.byte.last == "/" {
  // Different code path interpreting the semantic intent
  // of `path` differently.
}

stevapple · March 8, 2021, 4:36am

eg. According to Microsoft Doc, Windows will respect trailing separators in order to handle special directory names (though unrecommended).

I’m a bit confused (and curious) here. If the system doesn’t respect the trailing slash, why nearly all the calls to "/bin/sh/" will fail due to "Not a directory" error?

BTW how would you deal with \\?\ paths? These paths will not be normalized in any system APIs (except for explicitly passing to a normalizing method).