Foundation URL Improvements

Saklad5 · December 20, 2021, 3:20am

In practice, you’re probably right. That being said, I think the additional context of the URL initializer is useful here.

wowbagger · December 20, 2021, 5:08am

Adding to this, another example is that SwiftPM's Version is ExpressibleByStringLiteral, even though not all strings are valid versions.

Saklad5 · December 20, 2021, 6:44am

Yeah, I always use Version.init(_:_:_:prereleaseIdentifiers:buildMetadataIdentifiers:) for the compile-time check.

wowbagger · December 20, 2021, 7:17am

a reply not very relevant to the pitch

There doesn’t seem to be any obvious or significant benefit in using Version.init(_:_:_:prereleaseIdentifiers:buildMetadataIdentifiers:) over string literals, if only for the sake of compile-time checking. Because for most projects, SwiftPM’s runtime should only be part of your project’s compile time. Unless you use Version outside of Package.swift, you should be able to catch version string errors during compilation, regardless of which initializer you use.

icharleshu · December 21, 2021, 1:00am

Thanks for the suggestions!

icharleshu · December 21, 2021, 1:06am

Sorry I just realized I didn't make this clear: we are definitely not shutting down the possibility of using FilePath in the future. For now (in the short-term) we want to focus on improving URL itself since it's very popular and a core component of Foundation. We are evaluating different approaches, such as using FilePath for File IO tasks, for the long run.

Karl · December 21, 2021, 12:17pm

Also, I realise that bug fixes don't really come within the realm of "improvements", but I have filed 9 bugs for Foundation URL recently, and I don't think anybody from the Foundation team has seen them. (Update: 10! - SR-15632).

Some of them seem like they will need quite a bit of work. For example, URL.standardized seems to be rather naive - it is actually able to turn paths in to hostnames (!!!), doesn't properly account for relative URLs, and sometimes removes too many slashes. Some of the tests in Foundation's test suite are also wrong - they do not align with modern standards, or even other implementations of the same standard.

Also, if the path contains percent-encoded bytes which are not UTF-8, URL (and URLComponents, for that matter) just return empty strings. Which is very not cool.

let urlA = URL(string: "s://s/somepath/%72")!
print(urlA.path) // "/somepath/r"

let urlB = URL(string: "s://s/somepath/%82")!
print(urlB.path) // ""

URLComponents will also automatically percent-decode components if they include a ";":

//                                   V
URLComponents(url: URL(string: "file:;%2Fabc%2Fdef%2F")!, resolvingAgainstBaseURL: true)?.percentEncodedPath
// "%3B/abc/def/"

URLComponents(url: URL(string: "file:%2Fabc%2Fdef%2F")!, resolvingAgainstBaseURL: true)?.percentEncodedPath
// "%2Fabc%2Fdef%2F"

These combine to mean that, for some URLs, it is literally impossible to get the raw, percent-encoded path:

let url = URL(string:"s:/;foo%82")!
let comps = URLComponents(url: url, resolvingAgainstBaseURL: true)!

print(url) // "s:/;foo%82"
print(url.path) // "" - this is SR-15508

print(comps.string) // "s:" - the URLComponents initializer does it, too!
print(comps.path) // ""
print(comps.percentEncodedPath) // ""

Getting the raw path is crucial if you're, say, performing a network request. That's why I suggested giving easy access to the percent-encoded components directly from URL (without going through URLComponents, as that involves more allocations and potentially alters the path).

But I think is important to point out that, while API improvements are nice, the implementation really needs some work because it just isn't suitable for these kinds of tasks as it stands today. Those are more meaningful improvements that I would like to see in URL.

Why do these bugs happen?

A lot of these bugs can be traced back to the decision to automatically percent-decode URL components. It's not obvious, and apparently, even the Foundation team forget that it happens. For example, the semicolon bug looks to originate here. CFURLComponentsCopyPath removes percent-encoding:

github.com

apple/swift-corelibs-foundation/blob/efedfc30e7ea2ce5e4db85e3869b9cb149273ad9/CoreFoundation/URL.subproj/CFURLComponents.c#L164


      
              result = (CFURLComponentsRef)_CFRuntimeCreateInstance(alloc, _CFURLComponentsGetTypeID(), size, NULL);
              if ( result) {
                  // copy the _URIParseInfo into the result
                  memcpy(&result->_parseInfo, &parseInfo, sizeof(parseInfo));
                  
                  result->_lock = CFLockInit;
                  
                  result->_urlString = CFStringCreateCopy(alloc, string);
                  
                  // if there's a semi-colon in the path (what used to delimit the deprecated param component)
                  if (result->_parseInfo.semicolonInPathExists) {
                      // this will percent-encode it
                      CFStringRef path = _CFURLComponentsCopyPath(result);
                      _CFURLComponentsSetPath(result, path);
                      if ( path ) {
                          CFRelease(path);
                      }
                  }
              }
          }
          return ( result );

And another bug (where URL.pathComponents returns the wrong path components) also seems to originate from the same sort mistake. NSURL.path also removes percent-encoding:

github.com

apple/swift-corelibs-foundation/blob/efedfc30e7ea2ce5e4db85e3869b9cb149273ad9/Sources/Foundation/NSURL.swift#L41


      
                  return path._nsObject.standardizingPath
              }
          #if os(Windows)
              return path.unixPath
          #else
              return path
          #endif
          }
          
          
internal func _pathComponents(_ path: String?) -> [String]? {
              guard let p = path else {
                  return nil
              }
          
          
    var result = [String]()
              if p.length == 0 {
                  return result
              } else {
                  let characterView = p
                  var curPos = characterView.startIndex
                  let endPos = characterView.endIndex

So... you know, it may seem obscure to ask for something like percentEncodedPath directly on URL. Like, who cares about that, right? But really, these things are seriously important. When you need them, you need them. And currently URL is falling short when it comes to providing them.

I don't blame the team maintaining URL for making these sorts of mistakes -- it is very subtle and hard to remember, but it does pose the question of how we expect less-expert users to do the right thing?

This is why almost every other URL library returns percent-encoded components and does not perform decoding automatically (AFAIK, only Go and Foundation do). That would be too much of a breaking change for Foundation, but it should be critically important (IMO) to at least provide that data easily, and not make users fumble around going from URL to URLComponents.

So yeah, I just thought that could do with a bit of elaboration. In hindsight it might not have been clear why I asked for those things.

dhoepfl · December 21, 2021, 1:31pm

Throwing in SR-13846 as another bug report for the list.

CharlesS · December 21, 2021, 1:51pm

To be fair, the file system call is only to determine whether the URL is a directory or not, so it knows whether or not to use the trailing / character in the URL path. If you use .appendingPathComponent(_:isDirectory:) instead, you avoid the file system call.

Of course, that is even more painful to write than the already obnoxious .appendingPathComponent.

tera · December 21, 2021, 2:06pm

I would gladly call the "isDirectory:" variant if I knew that the other variant makes a blocking File system call (e.g. if the other was deprecated).

How about this naming:

url.appendingDirectoryName("foo").appendingFileName("bar")

or even smth like this, where the trailing "/" is not taken literally but converted in to the corresponding "isDirectory: true" parameter:

url + "foo/" + "bar"

and even this can be supported:

url + "foo/bar/baz" // Edit: see the AC/DC case discussed in a message down below

and internally translated into:

url
    .appendingPathComponent("foo", isDirectory: true)
    .appendingPathComponent("bar", isDirectory: true)
    .appendingPathComponent("baz", isDirectory: false)

tclementdev · December 21, 2021, 2:17pm

How do you know for sure that the item is a file or directory? What if it is later replaced on disk before you use the URL? What if you are building a path URL that doesn't yet exist on disk? It seems to me like the variant that should go away is the one asking what kind of item is the path component (and to be clear, no filesystem access should be done at this time).

CharlesS · December 21, 2021, 2:23pm

On the proposal itself—I love it. I think I asked for that init(filePath:) initializer back in the early days... it will certainly make working with the file system less tedious. A few questions only:

Is it feasible to make the new StaticString-based initializer use URLComponents to construct its URLs, so that it can conform to RFC-3986, which is at least from this century, instead of the ancient RFC-1738?
Why does the FilePath-based initializer return an optional, anyway, when the String-based initializer doesn't? I know this was pre-existing behavior, but the documentation doesn't have anything to say on the matter, and having the two versions looking so otherwise similar to each other causes the contrast to stick out.

Karl · December 21, 2021, 6:27pm

I don't know about the difference between those initializers, but it is quite right that not all file paths can be turned in to URLs. I spent quite a long time researching this - looking at implementations in Foundation, Chrome, Safari, Firefox, rust-url, and others, and built a comprehensive test suite. For reference, here are the failure conditions I came up with for WebURL: Documentation

Windows long paths are one easy case: they don't just reference files, but kernel objects resolved by something called the "object manager" (stuff like \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy1). They can contain literal forward slashes, "." and ".." components, and even NULL bytes. Of course, you can percent-encode these things (well, some of them), but anybody who receives such a URL probably isn't equipped to handle it. Unfortunately, they are the recommended way to bypass Windows' 260-character path limit.

These things are just the worst. You can actually create files and folders which are inexpressible by standard Windows paths, so they won't show up on the Desktop, Explorer can't open them, can't delete them, and almost no other applications can either. Pics for proof. They're not supposed to be normalized at all, but Windows' own APIs are inconsistent, so you can never really tell where they're going to go. I pity users who happen to find such files or folders on their hard drives. The safest option is to just fail if you see anything out of the ordinary.

Windows UNC paths can also contain hostnames, which may not be valid.

Relative paths are another thing that should probably be rejected. To resolve them, you need to look at the current working directory, which as discussed above, is not thread-safe. Imagine if you have one thread processing files, and it changes the working directory, and suddenly all your file URLs point to a different location.

This could be a careless accident by some dependency of a dependency, or part of a sophisticated attack on your server (if you think this won't happen, I can show you some impressive CVEs for path traversal attacks), and is another example of how Foundation's URL processing depends on external hidden state.

I don't know if it can be changed at this point, but if there are options to tighten up the (still relatively new) FilePath initializer, we should take them. It should be allowed to fail.

Saklad5 · December 21, 2021, 7:28pm

I think this is a good time to remind people that Optional.unsafelyUnwrapped exists, and people can use it for things they know have a value without incurring any performance penalty in production code.

If it makes sense for an initializer to be fallible, it should be. If you are certain it won’t fail, you can use that. There’s no reason to implement infallible initializers that can’t actually accept all input.

Karl · December 21, 2021, 7:33pm

Well, in this case you're allocating an object (a URL), so I wouldn't worry about the performance impact of unwrapping the optional. It's certainly not worth the risk of UB.

Saklad5 · December 21, 2021, 7:37pm

unsafelyUnwrapped is basically the same as !, but using assert instead of fatalError. Regardless of actual performance impact, it’s a useful distinction for understanding expected code paths.

icharleshu · December 21, 2021, 11:09pm

Could you elaborate on what issues you are seeing? URL(string: "https://example.com/portal/notices;columnsId=123;pageNo=1") does return an URL.

icharleshu · December 22, 2021, 12:14am

Unfortunately, I think this is the case of "the ship has already sailed." URL already "tangles" with the file system in many ways. For example:

.resourceValues(forKeys:) and friends need to talk to the file system to retrieve metadata.
As others have pointed out, init(fileURLWithPath:) talks to the file system to determine whether the path is a directory.

Both cases are very popular and "high profile" use cases of URL. We won't be able to move away from them soon.

Ironically, the "common directories" case proposed here doesn't depend on the file system because we are only returning predefined paths. It does not create directories for you if they don't exist. The currentDirectory case that Karl pointed out is an exception and I'm working on possible solutions for it.

icharleshu · December 22, 2021, 12:15am

Thanks for the suggestion!

icharleshu · December 22, 2021, 12:30am

Thanks for pointing this out. I didn't realize putting "use FilePath" under Alternatives Considered unintentionally implies that we won't consider using FilePath at all. I want to clarify that this is not true -- we are still evaluating the possibilities of using FilePath in Foundation in the future. This proposal mainly focuses on the "quick changes" that we can make to URL. IMO most (if not all) of these changes will improve URL itself regardless of whether we switch to FilePath or not in the future.

Thanks for the suggestion! I will look into this.