Confused by URL.standardized

gutley · July 20, 2021, 8:43am

URL.standardized recently gave me a different result to what I was expecting for a URL with a baseURL.

E.g

let fromAbsolute = URL(string: "https://www.apple.com/iphone/compare/../../mac")!
print(fromAbsolute.standardized.absoluteString)
// Prints https://www.apple.com/mac

let base = URL(string: "https://www.apple.com/iphone/")!
let fromRelative = URL(string: "compare/../../mac", relativeTo: base)!
print(fromRelative.standardized.absoluteString)
// Prints https://www.apple.com/iphone/mac

The output from fromAbsolute is what I'd expect for both of them (i.e go two parents up from 'compare' and down to 'mac', but maybe I'm misunderstanding how standardising works for relative URLs???

If I do this:

print(fromRelative.absoluteString)

(i.e without the standardized)
Then that 'correctly' outputs https://www.apple.com/mac.

Weird bug or just my own misunderstanding?

Karl · July 20, 2021, 2:29pm

Yeah, I think it’s a bug (or if it is intentional, the documentation for standardized isn’t a good description of what this operation does). Also see SR-14145.

What seems to be happening is that standardized will resolve .. components from its relative reference in isolation, without considering the base URL. So in your fromRelative example, the relative reference goes from compare/../../mac to just mac, and applying that to the base URL gives https://www.apple.com/iphone/mac.

idrougge · July 23, 2021, 11:31pm

Is .. as »parent directory« specified in any URL specification?

Karl · July 24, 2021, 7:41pm

Yes, basically all of them.

Historical notes

It's mentioned in RFC-1630, which basically just outlined the concept of a URL but didn't really standardise them. Back then, the format was just <scheme> ":" <path>, and the idea was that every scheme would define its own path format. Still, the / and dot components were reserved, which was necessary to have relative references (called "partial URIs" back then. Basically, the string which goes inside of an HTML <a> tag and lets you refer to pages relative to the current page's location):

The slash ("/", ASCII 2F hex) character is reserved for the
delimiting of substrings whose relationship is hierarchical. This
enables partial forms of the URI. Substrings consisting of single
or double dots ("." or "..") are similarly reserved.

In the context of URI: magic://a/b/c//d/e/f
the partial URIs would expand as follows:

../g -> magic://a/b/c//d/g
RFC-1738 walked it back a little bit (perhaps unintentionally), by just saying that the path depends on the scheme, but that many schemes (http, ftp, file) can be considered hierarchical and split by /s. It doesn't explicitly mention . or .. components.
A couple of years later, RFC-2396 came along with a definition for a generic syntax, because having each scheme do its own thing isn't all that helpful. It tightened up a lot of language, specified a lot of operations better, and defined the .. components in the context of relative references:

Within a relative-path reference, the complete path segments "." and
".." have special meanings: "the current hierarchy level" and "the
level above this hierarchy level", respectively. Although this is
very similar to their use within Unix-based filesystems to indicate
directory levels, these path components are only considered special
when resolving a relative-path reference to its absolute form
Many years after that, we had RFC-3986, which is the latest standard to go through the IETF. Again, the language is a lot better, more specific, and it doesn't just mention ".." components - it describes the algorithm to resolve them and includes many demonstrations.

The path segments "." and "..", also known as dot-segments, are defined for relative reference within the path name hierarchy. They are intended for use at the beginning of a relative-path reference to indicate relative position within the hierarchical tree of names. This is similar to their role within some operating systems' file directory structures to indicate the current directory and parent directory, respectively. However, unlike in a file system, these dot-segments are only interpreted within the URI path hierarchy and are removed as part of the resolution process
It's also part of the WHATWG URL Standard, where even percent-encoded . and .. components are interpreted, and the resolution algorithm includes some Windows drive letter quirks on all platforms when parsing file: URLs.

That being said, no URL specification defines what a particular library's APIs should do. Foundation invented the .standardized property, and it is their responsibility to document what it does. Currently, that documentation is quite poor - it is literally a single line:

A version of the URL with any instances of “..” or “.” removed from its path.

Even this is ambiguous - are those components removed (i.e. deleted), or is the path resolved? If the path is resolved, .. components may also remove other components.

Looking at the implementation in corelibs-foundation: as I suspected, it only considers the relative reference. In Foundation's model, the URL is actually the result of lazily resolving the relative part against the base URL, even though they are stored separately. A function like standardized should not change behaviour based on whether the URL is a relative reference on top of a base URL, or just an absolute URL with no base.

These kinds of issues come up relatively often, and are why WebURL uses an entirely different object model. All WebURLs are absolute, and the result of resolving a relative reference against one is another absolute URL. It has no need for properties such as absoluteURL, which change the internal representation, or standardized, and if you turn the URL in to a string and re-parse it, the result is exactly the same URL - the same string, interpreted in exactly the same way, with exactly the same behaviour from all APIs.

let url = WebURL("https://www.apple.com/iphone/")!.resolve("compare/../../mac")!
print(url) // https://www.apple.com/mac