Some updates. I'm sure these are issues that people have encountered/struggled with in the past, so do feel free to leave feedback.
This has now been incorporated in the URL standard. cannot-be-a-base URLs are now referred to as having an "opaque path". This makes it easier to explain these URLs in documentation and why they have certain limitations.
For example, javascript:alert("hello") is a URL with an opaque path. The reason you cannot parse a relative reference (e.g. ../foo) against it is because its path cannot be split in to a hierarchical list of components.
That's a lot more intuitive than saying "the parser decided this URL cannot be used as a base-URL", and it's better to find consensus and adjust the standard rather than inventing library-specific terminology.
I'm leaning towards not doing this (i.e. keeping the existing automatic percent-decoding for path components), for a couple of reasons:
-
Path components differ from URL-level components such as .path because they don't have URL-level structure which needs to be preserved. Automatically decoding the .path would be dangerous, because paths such as /foo/bar%2F..%2F would have their structure meaningfully altered by percent-decoding, but decoding the components after splitting the encoded path is safe.
(FWIW, even URL itself gets it wrong, and sometimes forgets that its own path property is percent-decoded
; this is where the biggest security concerns are. See [SR-15363] URL/NSURL.pathComponents splits the percent-decoded path · Issue #3195 · apple/swift-corelibs-foundation · GitHub)
-
Having a blanket "no automatic decoding" rule across the entire API for the sake of consistency seems really attractive, but some APIs would become almost unusable as a result. Especially for form-encoded query parameters, which require a special form of percent-decoding involving un-substituting "+"es to spaces:
var url = WebURL("https://example.com/currency/convert?amount=20&from=EUR&to=USD")!
url.formParams.to // "USD"
url.formParams.to = "US Dollars"
// url = "https://example.com/currency/convert?amount=20&from=EUR&to=US+Dollars"
url.formParams.to // Without decoding, this would return "US+Dollars"
url.formParams.to?.percentDecoded() // Incorrect. Still "US+Dollars"
url.formParams.to?.percentDecoded(substitutions: .formEncoding) // "US Dollars"
So the new rule would be: main components (hostname, path, query, etc) are returned "raw", but their sub-components would be returned decoded. It's an attempt to both be accurate and convenient, but is it too complex? Percent-decoding an already-decoded path component or query parameter is still dangerous.
In my brief investigation of other URL libraries, only rust-url, .Net's Uri type, and Foundation provide path component APIs:
- rust-url's
path_segments iterator keeps percent-encoding.
- .Net's
Uri.Segments keeps percent-encoding.
- Foundation's
URL.pathComponents returns percent-decoded path components as an Array.
On the other hand, when it comes to form parameters:
- rust-url's
query_pairs returns form-decoded key-value pairs.
- Java's
URLDecoder returns the form-decoded value for a key (with optional text encoding parameter).
- Go's
net/url package returns a view of form-decided values from its .Query() function.
- Foundation's
URLComponents.queryItems are percent-decoded, but not form-decoded.
As for non-UTF8 contents, they are relatively rare and difficult to work with in Swift anyway; it is still possible to get a raw, percent-encoded path component/query parameter, decode it to bytes, and interpret those bytes with the text encoding of your choice (or as binary data), but that's always going to be a bit of an involved process. It's probably worth optimising for the common case.