API changes for 0.2.0

Hi!

I'm getting ready to release version 0.2.0 of WebURL. It's going to be an exciting and important release, including a WebURLSystemExtras module for integration with System.framework and the swift-system package, and aligning the project with the very latest revision of the URL Standard.

It's also an important step on the journey to a 1.0 release with a stable API, and as part of that I'd like to discuss some changes I've made or am considering. I'd really like for this to be community-driven, with an API that we (the Swift community) are happy and comfortable with, and that we want to use in our projects. So scrutiny over the API, implementation, etc is very welcome.

The API reference can be found here: WebURL - WebURL. It's generated from the main branch, so it includes some of the changes listed below. I still need to figure how to publish both the main and latest-release documentation, as GitHub Pages only allows publishing a single branch.

Changes made

  • LazilyPercentDecodedUTF8 now conforms to BidirectionalCollection when its source collection does.

  • IPv4Address.parse(utf8:) and corresponding ParserResult type are deprecated. They will be included in 0.2.0, but removed in some later release. This API allowed you to distinguish invalid IPv4 addresses from ill-formatted addresses. For instance, "9999999999" would return .failure due to overflow, but "9999999999.com" would return .notAnIPAddress due to the ".com".

    Changes to the way hostnames are parsed in the URL standard made it impractical to keep supporting this feature - and besides, its functionality is probably too limited for the use-cases it was intended to serve. Instead, I'm going to explore adding a general hostname parser with support for IPv4, v6, and domain names in a later release.

  • cannotBeABase is deprecated, and replaced with isHierarchical. cannotBeABase will still be present in 0.2.0, but will be removed in a later release. The previous name was taken directly from the URL Standard, but I found it too opaque and difficult to explain in documentation. Cannot-be-a-base/non-hierarchical URLs (e.g. javascript:alert("hello") or mailto:bob@example.com) only support a very limited subset of features, and I found myself constantly using the phrase "non-hierarchical" to describe them anyway.

    This is a change from an opaque, capabilities-based property to a property based on the URL's syntax and structure (I proposed to include it in the standard, but it was a bit controversial for that reason. Discussions are ongoing). Nevertheless, in the context of this library, I think it is much easier to understand. It's important to try to make the URL's capabilities predictable, and if future versions of the standard make that difficult, I'll do my best to handle that at the library level.

Changes being considered

  • Stop automatically percent-decoding path components

    So right now we have this .pathComponents view (Documentation), which is a collection interface over the URL's path. The elements of this collection are Strings, and automatically percent-decoded from the raw contents:

    let url = WebURL("http://example.com/swift/packages/%F0%9F%A6%86%20tracker")!
    assert(url.pathComponents.first == "swift")
    assert(url.pathComponents.last == "🦆 tracker")
    
    let url = WebURL("file:///C:/Windows/🦆/System 32/somefile.dll")!
    assert(url.serialized == "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")
    assert(url.pathComponents.elementsEqual([
      "C:", "Windows", "🦆", "System 32", "somefile.dll"
    ]))
    
    // You can still access the raw components via the .utf8 view using an index.
    // We could add an easier API for this.
    let duckDirectory = url.pathComponents.dropFirst(2).startIndex
    assert(String(decoding: url.utf8.pathComponent(duckDirectory), as: UTF8.self) == "%F0%9F%A6%86")
    

    This is incredibly convenient, and when it works, it feels really nice and simple. But there are 2 issues: firstly, the contents may not be valid UTF-8. They often will be, but there's no law that they have to be, and in that case you get mojibake. URLs are wild - you see all kinds of weird legacy stuff floating about.

    The second issue it that, philosophically, I'm quite averse to automatically percent-decoding anything. Over-decoding is a well-known problem that has lead to security vulnerabilities in the past. I think this is actually a flaw in Foundation's APIs, and one that we should learn from. For example, the documentation for removingPercentEncoding says:

    Important:
    You must call this method only on strings that you know to be percent-encoded. Calling this method on strings that are not percent-encoded can lead to misinterpreting a percent character as the beginning of a percent-encoded sequence.

    But then, that decision is taken out of your hands somewhat because URL automatically percent-decodes its components (which the documentation doesn't mention, but given the above notice is clearly important information):

    import Foundation
    let url = URL(string: "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")!
    print(url.path) // "/C:/Windows/🦆/System 32/somefile.dll"
    

    Generally, WebURL does not automatically percent-decode anything - the .path, .query, .fragment properties are all just raw slices of the URL string. But then we break that principle for the pathComponents view for the sake of everything looking so nice in the common case.

    I'm genuinely not sure about the right way to proceed here. I'm particularly wary of the security implications of implicit percent-decoding. Perhaps it is better to just keep things as simple as possible, and not decode - after all, you could always use url.pathComponents.lazy.map { $0.percentDecoded } to get the same result. And consistency with the rest of the API is very important... but there is a pretty large hit to usability. Without implicit decoding, the earlier example would look like this:

    let url = WebURL("http://example.com/swift/packages/%F0%9F%A6%86%20tracker")!
    assert(url.pathComponents.first == "swift")
    assert(url.pathComponents.last == "%F0%9F%A6%86%20tracker") // ewww
    
    let url = WebURL("file:///C:/Windows/🦆/System 32/somefile.dll")!
    assert(url.serialized == "file:///C:/Windows/%F0%9F%A6%86/System%2032/somefile.dll")
    assert(url.pathComponents.elementsEqual([
      "C:", "Windows", "%F0%9F%A6%86", "System%2032", "somefile.dll"
    ]))
    

    Similar concerns apply to the form parameters view (Documentation).

  • Rename LazilyPercentEncodedUTF8/LazilyPercentDecodedUTF8 to LazilyPercentEncodedBytes and LazilyPercentDecodedBytes.

    They don't actually validate UTF-8 or transcode anything to UTF-8. They encode strings of raw bytes in to ASCII, and decode ASCII to raw bytes. You do occasionally see things like percent-encoded Latin-1, and having "bytes" in the name emphasises that the decoded content is not validated in any way and should be treated with caution.

That's it?

I think that's it. Generally I think the API is pretty good, but I'd love to know what you think about these changes or any other aspect of the API.

There are a bunch of things that I'd like to add in the future (e.g. access to the host parser, setting the host property using a Host object, data URLs and MIME types, etc). There's some pretty exciting stuff happening in the web standards community, too - including a URLPattern standard for Express-like routing, all built on the WHATWG URL Standard which this project aligns with.

Currently the focus is on stabilising what is there. With your help and input, I think we can make this a really great library and elevate lots of parts of the Swift ecosystem. Thanks for reading!

4 Likes
Terms of Service

Privacy Policy

Cookie Policy