API changes for 0.2.0

Karl · August 31, 2021, 5:48pm

Hi!

I'm getting ready to release version 0.2.0 of WebURL. It's going to be an exciting and important release, including a WebURLSystemExtras module for integration with System.framework and the swift-system package, and aligning the project with the very latest revision of the URL Standard.

It's also an important step on the journey to a 1.0 release with a stable API, and as part of that I'd like to discuss some changes I've made or am considering. I'd really like for this to be community-driven, with an API that we (the Swift community) are happy and comfortable with, and that we want to use in our projects. So scrutiny over the API, implementation, etc is very welcome.

The API reference can be found here: https://karwa.github.io/swift-url/. It's generated from the main branch, so it includes some of the changes listed below. I still need to figure how to publish both the main and latest-release documentation, as GitHub Pages only allows publishing a single branch.

Changes made

LazilyPercentDecodedUTF8 now conforms to BidirectionalCollection when its source collection does.
IPv4Address.parse(utf8:) and corresponding ParserResult type are deprecated. They will be included in 0.2.0, but removed in some later release. This API allowed you to distinguish invalid IPv4 addresses from ill-formatted addresses. For instance, "9999999999" would return .failure due to overflow, but "9999999999.com" would return .notAnIPAddress due to the ".com".

Changes to the way hostnames are parsed in the URL standard made it impractical to keep supporting this feature - and besides, its functionality is probably too limited for the use-cases it was intended to serve. Instead, I'm going to explore adding a general hostname parser with support for IPv4, v6, and domain names in a later release.
cannotBeABase is deprecated, and replaced with isHierarchical. cannotBeABase will still be present in 0.2.0, but will be removed in a later release. The previous name was taken directly from the URL Standard, but I found it too opaque and difficult to explain in documentation. Cannot-be-a-base/non-hierarchical URLs (e.g. javascript:alert("hello") or mailto:bob@example.com) only support a very limited subset of features, and I found myself constantly using the phrase "non-hierarchical" to describe them anyway.

This is a change from an opaque, capabilities-based property to a property based on the URL's syntax and structure (I proposed to include it in the standard, but it was a bit controversial for that reason. Discussions are ongoing). Nevertheless, in the context of this library, I think it is much easier to understand. It's important to try to make the URL's capabilities predictable, and if future versions of the standard make that difficult, I'll do my best to handle that at the library level.

Changes being considered

Stop automatically percent-decoding path components

So right now we have this .pathComponents view (Documentation), which is a collection interface over the URL's path. The elements of this collection are Strings, and automatically percent-decoded from the raw contents:
```
let url = WebURL("http://example.com/swift/packages/%F0%9F%A6%86%20tracker")!
assert(url.pathComponents.first == "swift")
assert(url.pathComponents.last == "🦆 tracker")

let url = WebURL("file:///C:/Windows/🦆/System 32/somefile.dll")!
assert(url.serialized == "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")
assert(url.pathComponents.elementsEqual([
  "C:", "Windows", "🦆", "System 32", "somefile.dll"
]))

// You can still access the raw components via the .utf8 view using an index.
// We could add an easier API for this.
let duckDirectory = url.pathComponents.dropFirst(2).startIndex
assert(String(decoding: url.utf8.pathComponent(duckDirectory), as: UTF8.self) == "%F0%9F%A6%86")
```
This is incredibly convenient, and when it works, it feels really nice and simple. But there are 2 issues: firstly, the contents may not be valid UTF-8. They often will be, but there's no law that they have to be, and in that case you get mojibake. URLs are wild - you see all kinds of weird legacy stuff floating about.

The second issue it that, philosophically, I'm quite averse to automatically percent-decoding anything. Over-decoding is a well-known problem that has lead to security vulnerabilities in the past. I think this is actually a flaw in Foundation's APIs, and one that we should learn from. For example, the documentation for removingPercentEncoding says:

Important:
You must call this method only on strings that you know to be percent-encoded. Calling this method on strings that are not percent-encoded can lead to misinterpreting a percent character as the beginning of a percent-encoded sequence.

But then, that decision is taken out of your hands somewhat because URL automatically percent-decodes its components (which the documentation doesn't mention, but given the above notice is clearly important information):
```
import Foundation
let url = URL(string: "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")!
print(url.path) // "/C:/Windows/🦆/System 32/somefile.dll"
```
Generally, WebURL does not automatically percent-decode anything - the .path, .query, .fragment properties are all just raw slices of the URL string. But then we break that principle for the pathComponents view for the sake of everything looking so nice in the common case.

I'm genuinely not sure about the right way to proceed here. I'm particularly wary of the security implications of implicit percent-decoding. Perhaps it is better to just keep things as simple as possible, and not decode - after all, you could always use url.pathComponents.lazy.map { $0.percentDecoded } to get the same result. And consistency with the rest of the API is very important... but there is a pretty large hit to usability. Without implicit decoding, the earlier example would look like this:
```
let url = WebURL("http://example.com/swift/packages/%F0%9F%A6%86%20tracker")!
assert(url.pathComponents.first == "swift")
assert(url.pathComponents.last == "%F0%9F%A6%86%20tracker") // ewww

let url = WebURL("file:///C:/Windows/🦆/System 32/somefile.dll")!
assert(url.serialized == "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")
assert(url.pathComponents.elementsEqual([
  "C:", "Windows", "%F0%9F%A6%86", "System%2032", "somefile.dll"
]))
```
Similar concerns apply to the form parameters view (Documentation).
Rename LazilyPercentEncodedUTF8/LazilyPercentDecodedUTF8 to LazilyPercentEncodedBytes and LazilyPercentDecodedBytes.

They don't actually validate UTF-8 or transcode anything to UTF-8. They encode strings of raw bytes in to ASCII, and decode ASCII to raw bytes. You do occasionally see things like percent-encoded Latin-1, and having "bytes" in the name emphasises that the decoded content is not validated in any way and should be treated with caution.

That's it?

I think that's it. Generally I think the API is pretty good, but I'd love to know what you think about these changes or any other aspect of the API.

There are a bunch of things that I'd like to add in the future (e.g. access to the host parser, setting the host property using a Host object, data URLs and MIME types, etc). There's some pretty exciting stuff happening in the web standards community, too - including a URLPattern standard for Express-like routing, all built on the WHATWG URL Standard which this project aligns with.

Currently the focus is on stabilising what is there. With your help and input, I think we can make this a really great library and elevate lots of parts of the Swift ecosystem. Thanks for reading!

Karl · October 28, 2021, 2:43pm

Some updates. I'm sure these are issues that people have encountered/struggled with in the past, so do feel free to leave feedback.

This has now been incorporated in the URL standard. cannot-be-a-base URLs are now referred to as having an "opaque path". This makes it easier to explain these URLs in documentation and why they have certain limitations.

For example, javascript:alert("hello") is a URL with an opaque path. The reason you cannot parse a relative reference (e.g. ../foo) against it is because its path cannot be split in to a hierarchical list of components.

That's a lot more intuitive than saying "the parser decided this URL cannot be used as a base-URL", and it's better to find consensus and adjust the standard rather than inventing library-specific terminology.

I'm leaning towards not doing this (i.e. keeping the existing automatic percent-decoding for path components), for a couple of reasons:

Path components differ from URL-level components such as .path because they don't have URL-level structure which needs to be preserved. Automatically decoding the .path would be dangerous, because paths such as /foo/bar%2F..%2F would have their structure meaningfully altered by percent-decoding, but decoding the components after splitting the encoded path is safe.

(FWIW, even URL itself gets it wrong, and sometimes forgets that its own path property is percent-decoded ; this is where the biggest security concerns are. See [SR-15363] URL/NSURL.pathComponents splits the percent-decoded path · Issue #3195 · apple/swift-corelibs-foundation · GitHub)

Having a blanket "no automatic decoding" rule across the entire API for the sake of consistency seems really attractive, but some APIs would become almost unusable as a result. Especially for form-encoded query parameters, which require a special form of percent-decoding involving un-substituting "+"es to spaces:

var url = WebURL("https://example.com/currency/convert?amount=20&from=EUR&to=USD")!
url.formParams.to // "USD"

url.formParams.to = "US Dollars"
// url = "https://example.com/currency/convert?amount=20&from=EUR&to=US+Dollars"

url.formParams.to // Without decoding, this would return "US+Dollars"
url.formParams.to?.percentDecoded() // Incorrect. Still "US+Dollars"
url.formParams.to?.percentDecoded(substitutions: .formEncoding) // "US Dollars"

So the new rule would be: main components (hostname, path, query, etc) are returned "raw", but their sub-components would be returned decoded. It's an attempt to both be accurate and convenient, but is it too complex? Percent-decoding an already-decoded path component or query parameter is still dangerous.

In my brief investigation of other URL libraries, only rust-url, .Net's Uri type, and Foundation provide path component APIs:

rust-url's path_segments iterator keeps percent-encoding.
.Net's Uri.Segments keeps percent-encoding.
Foundation's URL.pathComponents returns percent-decoded path components as an Array.

On the other hand, when it comes to form parameters:

rust-url's query_pairs returns form-decoded key-value pairs.
Java's URLDecoder returns the form-decoded value for a key (with optional text encoding parameter).
Go's net/url package returns a view of form-decided values from its .Query() function.
Foundation's URLComponents.queryItems are percent-decoded, but not form-decoded.

As for non-UTF8 contents, they are relatively rare and difficult to work with in Swift anyway; it is still possible to get a raw, percent-encoded path component/query parameter, decode it to bytes, and interpret those bytes with the text encoding of your choice (or as binary data), but that's always going to be a bit of an involved process. It's probably worth optimising for the common case.