How to use WebURL with HTTPRequestHead.uri?

taylorswift · February 4, 2022, 10:15pm

i’m replacing some ad-hoc percent-encoding logic in one of my web services, and i’m looking at using WebURL to handle the URI routing. however, it looks like this is not a supported use case of WebURL.init(_:). is there an any API dedicated to URI handling, without the scheme, host, or port?

Karl · February 5, 2022, 10:05am

Right, WebURL doesn't support purely-relative URLs.

Luckily, HTTP does not deal in purely-relative URLs; it's actually quite a common misconception. But I do have some advice for how to process these things. Apologies for delving in to the standard, but I think it helps to show how the advice is derived.

What the (HTTP) standard says

When a client constructs an HTTP/1.1 request message, it sends the target URI in one of various forms, as defined in (Section 5.3 of RFC7230). When a request is received, the server reconstructs an effective request URI for the target resource (Section 5.5 of RFC7230).

RFC-7231, HTTP Semantics and Content

So somebody makes a request, sending something called a "request target", which is derived from the URL, and can be in various forms depending on the message (GET vs CONNECT, etc). The most common form is the "origin form", which consists of the URL's path and query:

GET /where?q=now HTTP/1.1
    ^^^^^^^^^^^^ - request target in origin form

And then the server, receiving this, should use it to reconstruct the effective request URL. This may require two additional pieces of information:

The scheme. If the connection is made with TLS, that will be https, otherwise http
The host. The standard recommends various ways of divining that - from using the Host: header field (which may not be trustworthy), to a configuration option, all the way down to heuristics and guesswork.

Many servers do not require knowledge of the intended host to route the URL, and I believe most server frameworks discourage routing based on the host (e.g. the host may be an IP address or "localhost" when testing, but not in production), so you could also use a placeholder.

Once you've decided which scheme and host you want to use:

The components of the effective request URI, once determined as above, can be combined into absolute-URI form by concatenating the scheme, "://", authority, and combined path and query component.

Example 1: the following message received over an insecure TCP connection
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.example.org:8080
has an effective request URI of

http://www.example.org:8080/pub/WWW/TheProject.html

So... that may seem a little bit hacky, but it is actually what the standard says. Personally, when I first read that, I was a little disappointed it wasn't more impressive or robust. It feels wrong; almost like writing you're writing a server in a bash script, but I digress.

How to do it with WebURL

As ever with URLs, there is still plenty of room to make mistakes. One thing I sometimes see in server frameworks is that they will treat the request target as a relative reference and try to resolve it against a base URL.

But there are lots of kinds of relative reference, so processing it in this way can lead to incorrect results. One thing that can go wrong is that if the path begins with 2 slashes, it will be interpreted as a scheme-relative URL and can change the host:

base URL:     "http://example.com"
relative ref: "//abc/def?zz"

result:       "http://abc/def?zz" (host = "abc", path = "def", query = "zz")

Instead, since we know the request target consists of a path and query, we should split it and set the components individually. For WebURL, that would look something like this (taking the request target as a buffer of bytes; you could also use a String):

func getEffectiveRequestURL(
  https: Bool = false,
  hostname: String,
  requestTarget: [UInt8]
) -> WebURL {

  var effectiveURL = https ? WebURL("https://x/")! : WebURL("http://x/")!
  try! effectiveURL.setHostname(hostname)

  // The path is everything up to the first '?'
  // It should begin with a "/" (not checked here), but the result is the same either way.
  let queryDelimiter = requestTarget.firstIndex(of: UInt8(ascii: "?"))
  try! effectiveURL.utf8.setPath(requestTarget[..<(queryDelimiter ?? requestTarget.endIndex)])

  // And the query is everything after it.
  if let queryDelimiter = queryDelimiter {
    let queryStart = requestTarget.index(after: queryDelimiter)
    try! effectiveURL.utf8.setQuery(requestTarget[queryStart...])
  }
  return effectiveURL
}

Doing this, we get the correct result:

getEffectiveRequestURL(hostname: "example.com", requestTarget: Array("//abc/def?zz".utf8))

// "http://example.com//abc/def?zz"
// - host:  "example.com"
// - path:  "//abc/def"
// - query: "zz"

And then you can process the .pathComponents or .formParams to do your routing. Hopefully in the future we'll have some kind of pattern/regex support for doing that.

Hope it helps!

taylorswift · February 5, 2022, 5:56pm

everything after the host:port is opaque to the network, so its interpretation is entirely up to the server implementing the service. so i don’t see why the

is helpful here. it’s entirely possible to use the URIs (including the query) as dictionary keys, more-sophisticated path and query handling is entirely a UX problem. so what i was looking for was something that would convert the URIs to a representation that can be used to implement a navigation system that matches what a typical user expects. as far as i understand, the parts of the URL before the URI are only needed to deliver the request to the right server, they should not be involved in internal routing unless the server has more than one domain name mapped to it.

also, the HTTP layer lives on top of the TLS layer, so the HTTP handler shouldn’t really care or know about how the message was secured.

Karl · February 5, 2022, 6:43pm

The scheme can be important, because it determines how the path is interpreted. For example, in the WHATWG model, backslashes in the path are normalised to forward-slashes, but only for URLs with "special" schemes (i.e. schemes whose semantics are known to the standard; http(s), ws(s), ftp, and file).

Of course, that behaviour is not specified for HTTP request targets, since they are technically neither URLs nor relative references. However, if you were to do something different and give unescaped backslashes special meaning in the request target, no browser or other conforming actor on the web could form a request to it (via a URL). So it seems reasonable to match how the path would be interpreted if it were to be found in a http(s) URL.

When it comes to the host - as you say, it can be important if the server has more than one domain mapped to it. If that's not the case, a placeholder is fine -- but you will need a placeholder, because http(s) URLs with no hostname or an empty hostname are considered invalid by construction. The standard forbids them from ever existing.

While it is true that layers higher in the stack should not need to care about things lower down, they often want to care.

To give an analogy - when I'm streaming a video to my phone, the application does not necessarily need to worry whether I'm on a cellular connection or WiFi. It should technically work either way. But that information can still be useful; it might decide to stream a lower-bitrate copy since the transport is assumed to have less capacity and/or be less reliable, or it might delay some non-critical download that would drain my battery and eat too much of my data allowance.

Similarly, you may decide not to serve certain content unless you know the transport is secure. If that's not a factor, you can ignore the difference between http and https when building the "effective request URL".

taylorswift · February 6, 2022, 1:38am

i’m not building a full-on file server, so the number of possible URIs i need to route is quite small. what’s more important for my use case is that the URI is tolerant of minor spelling variations, so you can access the same resource with or without trailing slash, percent-encoding, etc.

other people might be doing something different, but i’m discriminating between http and https based on port number. so i have a really simple endpoint listening on port 80, that just redirects all traffic to the real endpoint on port 443, which then normalizes the URI and if needed, serves yet another 301 redirect.

so if you visit example.org:80/foo//bar/ it would first redirect you to example.org:443/foo//bar/ and my ad-hoc URI logic normalizes /foo//bar/ to /foo/bar, and responds with the resource mapped to "/foo/bar".