I'm trying to figure out the encoding story for this, and... it's just awful. It isn't a functionality problem, because at the end of the day URLs can just percent-encode arbitrary bytes and accurately preserve the path, whatever its encoding, but there's a serious usability problem.
My first draft of the API adds the following functions:
// Path2URL
extension WebURL {
public init<S: StringProtocol>(
filePath: S, style: FilePathStyle = .native
) throws
public static func fromFilePathBytes<Bytes: Collection>(
_ path: Bytes, style: FilePathStyle = .native
) throws -> WebURL where Bytes.Element == UInt8
}
// URL2Path
extension WebURL {
public func filePath(style: FilePathStyle = .native) throws -> String
public static func filePathBytes(
from url: WebURL, style: FilePathStyle = .native
) throws -> ContiguousArray<UInt8>
}
But after trying to write documentation for these functions, I'm reluctantly coming around to the idea that the String versions just aren't going to work; there'd be a bunch of intricate caveats telling developers not to use it in this case, or that case, etc. - most of which are difficult, if not impossible, for developers to predict - and that they should traffic in terms of arrays instead.
I mean, this is a snippet from my current attempt to document WebURL.fromFilePathBytes. I can't imagine an average developer reading this and understanding what they're supposed to do. I don't think it's because it is poorly-worded (or at least, that isn't the only reason); there is just inherent complexity that is difficult to smooth over:
/// ## Encoding
///
/// This function accepts its path as a `Collection` of bytes, which allows certain paths to be expressed more precisely than `String` allows.
///
/// ### POSIX
///
/// POSIX-style paths are typically considered semi-arbitrary byte sequences; path components are delimited by the ASCII forward-slash (`0x2F`),
/// a component consisting of one or two ASCII periods (`0x2E`) is interpreted as a reference to the current or parent directory, respectively,
/// and the ASCII null byte (`0x00`) is often considered to be the end of the byte sequence - but otherwise, file and directory names are just opaque bytes.
/// In practice, file and directory names are often UTF-8, but they may not be, and so creating a `String` of a filesystem path may corrupt it by replacing certain
/// bytes with replacement characters (`�`).
///
/// Besides the reserved bytes listed above, this function does not assume that file or directory names have any particular encoding or interpretation.
/// Any bytes which would be interpreted by URL semantics are preserved by percent-encoding, so they may be decoded to their original values.
/// Note that the same considerations apply when converting the file URL back to a path - if the encoded bytes could not be losslessly
/// represented as a Swift `String` _before_ conversion to a URL, that will still be the case when performing the reverse transformation.
/// Use `WebURL.filePathBytes(from:style:)` to obtain the precise bytes of the path without Unicode replacements performed by `String`.
///
/// Note that on macOS and other Darwin platforms (iOS, iPadOS, tvOS, etc.), as well as ChromeOS, paths are guaranteed to be valid UTF-8.
/// The exact sequence of bytes may vary as the operating system performs Unicode normalization on file and directory names, meaning the percent-encoded
/// bytes in their URL representations may also differ, but their paths can always be losslessly represented by Swift's `String` type,
/// and `String`'s Unicode-aware comparison will ensure that these paths compare as equal to each other.
///
/// ### Windows
///
/// Windows paths (since Windows NT) are natively UTF-16-LE and are _not_ opaque byte sequences. The platform APIs expose them both as sequences
/// of 16-bit code-units (via the `-W` APIs) and, for legacy reasons, as sequences of bytes transcoded to the system's active code page (via the `-A` APIs).
/// These latter APIs are fundamentally lossy, as the active code page typically cannot represent every Unicode character, so users should take care
/// to use the `-W` APIs when interfacing with the Windows filesystem, unless the active code page is known to be a Unicode encoding such as UTF-8.
/// Well-formed UTF-16 can be converted to UTF-8 and losslessly round-tripped back to the same sequence of UTF-16 code-units, so this is the recommended
/// way to create a file URL from a Windows path.
///
/// However, it would be _far too easy_ if everything was well-formed UTF-16; so to keep things interesting, Windows also allows ill-formed UTF-16,
/// such as unpaired surrogate code-points, in file and directory names. These simply cannot be expressed in UTF-8, and so, unfortunate
/// as it may be, the expression of these code-points in an 8-bit encoding is left as an exercise for the reader. [WTF-8][WTF-8] may be a good choice,
/// but since it is a relaxed/intentionally broken version of UTF-8, users are discouraged from passing such URLs around to other applications,
/// which may not be able to decode them correctly.
///
/// As with POSIX paths, this function only interprets a small number of ASCII byte values - the forward- and back-slashes (`0x2F` and `0x5C`),
/// period (`0x2E`), space (`0x20`), and colon (`0x3A`) - and percent-encodes any other bytes which may be interpreted by URL semantics.
/// For UNC paths, the server name must be valid UTF-8 as it may be subject to IDNA normalization, which requires valid Unicode text.
(Also, lots of documentation that I've found seems to suggest that the only reserved bytes for POSIX paths are / and NULL. I don't think that's strictly true - I got quite worried about ASCII periods and went digging through the Linux kernel, but it turns out that Linux also interprets ASCII . and .. path components)
If I cut the String initializer and String-returning function, I'm left with functions which traffic in arrays or collections of bytes, which is just a really poor API. It makes real-world usage incredibly awkward - to give one basic example, if developers want to pass the returned array to an imported C filesystem API, it would need to be null-terminated, but if they wanted to construct a String, they'd need to strip the trailing null or remember to use the String(cString:) initializer.
Which leads me to think that the best way forward is to wrap the array, to capture that it is a "sort-of-string". This brings me to swift-system, which fortunately already contains just such a type: SystemString. Imagine if the APIs described above returned a SystemString rather than a String or array of bytes; immediately they would be so much better. Not only does it have a descriptive name, it could be a currency type which could be used in filesystem operations directly, if we could share this with swift-system.
Unfortunately, even if SystemString was a public API, adding a dependency on swift-system would be too high a burden: SwiftPM does not support optional dependencies/cross-import overlays, so this would have to be an unconditional dependency, even for users who don't care about file paths. Even if I could add it as an optional dependency, the roadmap for swift-system includes all sorts of platform APIs (e.g. processes and signals, sockets, pthreads, even ttys), and I can't limit the dependency to this small piece of it.
Maybe it's completely out of the question, but - would it be possible to duplicate/move SystemString to a lower-level library that could be shared?
And if I could maybe push it a little bit further - what about splitting out FilePath? IIUC, it is purely lexical, so it doesn't require any actual platform APIs (although they could be added in swift-system). That would allow this library (and others) to return an even higher-level type, and enable software which can accurately create paths for remote systems.
@Michael_Ilseman what do you think?