Could WebURL.PathComponents be standalone?

taylorswift · April 2, 2022, 9:53pm

WebURL.PathComponents would be a really nice abstraction for swiftinit.org to use to perform its URL routing, since it serves over 100,000 unique endpoints, and really depends on the ability to store a percent-encoded URI string in a single buffer allocation in order to keep memory usage down.

however, it doesn’t seem to be possible to use this abstraction to model a URI path, or subpath, by itself, without being attached to a larger WebURL instance. this is unfortunate, because a swiftinit url looks like:

https://swiftinit.org/reference/swift-foo/0.2.3/foomodule/foo.bar%28_:_:%29
|<---------entirely irrelevant -------->|

but we really only want to be storing segments that look like

"/0.2.3/foomodule/foo.bar%28_:_:%29"
"/foomodule/foo.bar%28_:_:%29"
"/foomodule/foo"
"bar%28_:_:%29"

could WebURL.PathComponents evolve into something that can be used on its own?

along a similar vein, could we also get a type like Component (like FilePath.Component) that models a single percent-encoded URL component, and never includes an unescaped / character?

Karl · April 3, 2022, 8:11pm

So, firstly - if you want to read raw path components (without automatically unescaping), there is a raw: subscript, and the UTF8 view includes a pathComponent function, which tells you exactly where a path component is in the overall string.

We try not to make any assumptions about what path components mean. Percent-encoded UTF-8 text is about as far as we can reasonably go, and there are plenty of situations where even that is too far, which is why those raw APIs exist (e.g. file URL paths might decode to arbitrary bytes rather than UTF-8).

Perhaps. In general I'm happy to expose lower-level APIs to process data as URLs do (even if not stored in a WebURL object). Depending on how low-level they are, they may come with weaker stability guarantees.

Reading

The PathComponents view relies heavily on its path string being normalised. So if you're starting with, say, the path string in a GET request, that means handling all of the weird compatibility quirks - for instance, whether or not backslashes are interpreted as path separators depends on the URL's scheme:

"/foo\bar/baz" - what does it mean?

HTTP:  ["foo", "bar", "baz"] -> "/foo/bar/baz"
OTHER: ["foo\bar", "baz"]    -> "/foo\bar/baz"

Windows drive letters also mess up how relative references are resolved in file URLs, etc. There's a lot of weird stuff.

In WebURL, the _PathParser sorts all of that mess out. It has quite a unique implementation; most others (e.g. WebKit, Rust) will allocate a vector to keep track of the path as it is being parsed, but we do it without any heap allocations at all. Doing it this way involved building up a lot of test infrastructure, and exposed a fair number of coverage gaps and bugs in the URL Standard (fixed now; that's the benefit of a living standard), so I'd be happy if people got more use out of it! Using the path parser is relatively straightforward and flexible.

You can try using the SPI, which returns the simplified path string as though it were being set on a URL. After that, once the path is normalised/simplified, reading path components is just splitting on ASCII forward-slashes and percent-decoding as necessary. The parsing/normalisation is the most difficult thing for reading.

Writing

Modifying paths is a whole other bottle of trouble. It's difficult to know which operations are allowed on URL paths, and sometimes it can depend on facts about the other URL components (really).

Exposing that logic would probably be quite difficult. For example, whether or not you can set a URL's path to the empty string (or an empty collection) depends on its scheme, and details like whether or not it has a hostname. Sometimes, depending on its contents, you need to escape the path itself within the URL string.

But yeah, it suggests that perhaps it does make sense for WebURL to offer a freestanding URL path type some day. For now, I don't think it's necessary for v1.0, and the pieces (at least for reading) are semi-exposed if you want to DIY.

taylorswift · April 3, 2022, 10:20pm

for my use case (a server implementation), the scheme is always HTTP (+TLS). backslashes are invalid, and for swiftinit.org specifically, never actually occur since \ is not a valid swift operator character.

for my use case (a server implementation), i have no intention of mapping URLs to a file system, in fact this is something i actively avoid doing, since it presents a security hazard.

. and .. are not something i intend to support, since it provides little utility to visitors, and seeing as it’s easy to imagine how something like that could be abused by an attacker down the road.

i think our difference in paradigm stems from the fact that WebURL treats a URL as an entire interdependent object that must always be kept internally-consistent, which is useful for ensuring that you never generate an invalid URL. however, it’s just not very useful for server-side use cases where the layout of the site never strays near the dangerous edge cases.

what swiftinit basically has to do is (spitballing here):

let uri:URI = "/reference/swift-foo/0.2.3/foomodule/foo.bar%28_:_:%29"
// /swift-foo/0.2.3/foomodule/foo.bar%28_:_:%29
let package:URI.SubSequence = uri.dropFirst()
// /0.2.3/foomodule/foo.bar%28_:_:%29
let version:URI.SubSequence = package.dropFirst()
// /foomodule/foo.bar%28_:_:%29
let module:URI.SubSequence = version.dropFirst()
// /foo.bar%28_:_:%29
var path:URI = .init(module.dropFirst())

path.insert("FooModule".lowercased(), at: path.startIndex)
if versioned 
{
    path.insert("\(major).\(minor).\(patch)", at: path.startIndex)
}
if !whitelisted 
{
    path.insert("swift-foo".lowercased(), at: path.startIndex)
}
path.insert("reference", at: path.startIndex)

if path == uri 
{
    self.respond(...)
}
else 
{
    self.redirect(...)
}

it would be nice is WebURL had an API for this.

Karl · April 4, 2022, 4:37pm

Right, but HTTP recommends that you use the "request target" to reconstruct the URL used to make the request. In the web's URL model, the following paths are identical:

/foo/bar
\foo\bar
/foo/tmp/../bar

That last one may look a bit funny, but some clients (such as Foundation) can actually send requests to your server which look like:

GET /foo/tmp/../bar HTTP/1.1

At which point, you have to decide what to do. You can use WebURL's parser to interpret them as part of a web-compatible HTTP URL, or you can do something different - either show a different resource for each one, or just reject the request entirely.

If you show a different resource, those resources will not be visible to the web. It is impossible for HTTP URLs on the web to even express something like "/foo/tmp/../bar" as a distinct location, so modern browsers wouldn't be able to view that page. Your resource may be visible to some clients (like Foundation) which will send these requests, but not for most other clients, at least not via a URL.

Or you can reject the request (as you say you're doing). But then again, not all clients abide by the standard, so some actually do send these kinds of requests for perfectly innocent reasons. I managed to get Foundation to send that one using an HTTP redirect, but there are probably other clients who will send them for other mundane tasks - perhaps when resolving relative links on HTML pages, they just leave them as "/foo/tmp/../bar" and let the server deal with them.

Despite those clients, best practice on the web is to be lenient in what you accept, and strict about what you transmit. If you just reject things you don't want to deal with, like ".." components, backslashes, etc., you are instead being strict in what you accept, and your server may not work with all clients as you expect.

. and .. are not something i intend to support

I hope this explains a bit about why all of that weird web compatibility behaviour is important. To be honest, I'm not sure I agree that you even have a choice about supporting it; if you're going to be serving sites on the web, I think you are basically obligated to support "." and ".." components, as well as the rest of the web's compatibility requirements. It's quite important that we all agree about which URLs are structurally identical, both for the clients making requests and for the servers accepting them.

This is the baseline level of identity - servers can choose to be more relaxed than this (treating more URLs as identical), but they cannot be stricter (treating more URLs as distinct) while being web compatible.

taylorswift · April 4, 2022, 11:14pm

okay so, these are all great points. after reading this, i think swiftinit should support vertical navigation with .. and .

i think . could be added without many problems; .. is more problematic since .. is a valid swift operator lexeme. but we could probably work around that for now by always requiring that to be spelled ..%28_:_:%29.

however, i just don’t know about the argument for keeping the scheme and hostname around. as i mentioned, storing an entire URL from start to finish, like

https://swiftinit.org/reference/swift-foo/0.2.3/foomodule/foo.bar%28_:_:%29
https://swiftinit.org/reference/swift-foo/0.2.3/foomodule/foo/baz.bar%28_:_:%29

just isn’t practical from a memory-usage standpoint. what swiftinit does internally is it just stores the parts after foomodule, in decomposed stem × leaf form. so for example, its internal routing table looks (conceptually) kind of like:

typealias StemID = UInt
typealias LeafID = UInt
let stems:[String: StemID] = 
[
    "/foo": 0,
    "/foo/baz": 1,
]
let leaves:[String: LeafID] = 
[
    "bar%28_:_:%29": 0,
]
// pretending tuples are Hashable
let table:[(StemID, LeafID): Symbol] = 
[
    (0, 0): ..., // page for "foo.bar%28_:_:%29"
    (1, 0): ..., // page for "foo/baz.bar%28_:_:%29"
]

how would you recommend approaching this?

somewhat related:

it sounds like WebURL treats

https://swiftinit.org/reference/swift-foo/foomodule/foo.bar%28_:_:%29

and

https://swiftinit.org/reference/swift-foo/0.2.3/../foomodule/foo.bar%28_:_:%29

as equivalent under ==(_:_:). but this means we can’t compare the request URI and the canonical URI in order to issue permanent redirects.

on the other hand, comparing the raw percent-encoded HTTPRequestHead.uri with the canonical URI rendered as a percent-encoded String isn’t right either, since that would pick up meaningless differences in percent encoding and hex digit capitalization, which could send a client into a redirect loop.

does WebURL provide API for determining if a redirect should be issued?

Karl · April 4, 2022, 11:49pm

taylorswift:

however, i just don’t know about the argument for keeping the scheme and hostname around. as i mentioned, storing an entire URL from start to finish, like
https://swiftinit.org/reference/swift-foo/0.2.3/foomodule/foo.bar%28_:_:%29
https://swiftinit.org/reference/swift-foo/0.2.3/foomodule/foo/baz.bar%28_:_:%29
just isn’t practical from a memory-usage standpoint.

Yeah, that's something that WebURL doesn't currently support. To state the obvious, when you're building custom routing tree data structures, you're doing custom routing, and will probably want a lot of control over the implementation. It needs quite a lot of thought, so it's definitely not on the agenda before v1.0, if it's even the kind of thing WebURL should offer at all.

That said, the thing that we definitely, 100% can help you with is path parsing/normalisation - in other words, how to process the path as though it were part of a URL. We can help you understand its structure, and then you use your custom processing to decide what it means to your application.

For now, I mentioned an SPI that could help you: _simplifyPath. It doesn't add percent-encoding (which is fine since I guess your tree will remove it anyway), but otherwise it will simplify the path string for you.

taylorswift:

it sounds like WebURL treats
https://swiftinit.org/reference/swift-foo/foomodule/foo.bar%28_:_:%29
and
https://swiftinit.org/reference/swift-foo/0.2.3/../foomodule/foo.bar%28_:_:%29
as equivalent under ==(_:_:) . but this means we can’t compare the request URI and the canonical URI in order to issue permanent redirects.

The web treats them as identical. I mentioned this in the SSWG proposal:

Foundation.URL generally tries to keep URL strings as you provide them. Its parser is strict about minor syntax mistakes, and components are generally treated as opaque strings (except for percent-encoding) and not automatically normalized.

The new URL standard, on the other hand, essentially requires a different model. For one thing, it defines two operations for URL strings: parsing a string in to a URL record, and serializing a URL record as a string. WebURL doesn't just scan URL strings - it also interprets them , breaking them down in to URL records and rewriting them. It has a more complete understanding of what a URL means, which allows it to offer richer APIs with stricter guarantees, such as the guarantee that WebURL values are always normalized .

As part of parsing the latter string, the /0.2.3/.. portion is interpreted and removed. Like it never existed. And since those 2 URLs then have the same string, they are identical.

It's kind of like String inserting Unicode replacement characters for invalid bytes - you can't ask which invalid byte caused a specific replacement character later. That information is just gone.

Personally, I wouldn't try to detect these URLs and offer redirects. I'd just let path normalisation... well, normalise them! And after that, you can just handle them like a cleanly-written path without any ".." components or other weird stuff. I think that's the ideal case - so you can use WebURL to interpret a URL's path in a web-compatible way (structurally), then hand the pieces over to your application logic to process further.

taylorswift · April 6, 2022, 12:11am

they need to be normalized because users may copy-and-paste the URLs from the browser navigation bar. this is bad for SEO.

for example, discourse redirects

https://forums.swift.org/t/could-weburl-pathcomponents-be-standalone////56478

to

https://forums.swift.org/t/could-weburl-pathcomponents-be-standalone/56478

it looks like _simplifyPath(_:) is an instance method on an SPI view of WebURL. if i’m not mistaken, constructing one of those requires knowledge of the URL host. is there a static func version of that API that assumes the scheme is HTTP(S)?

Karl · April 6, 2022, 4:56pm

taylorswift:

they need to be normalized because users may copy-and-paste the URLs from the browser navigation bar. this is bad for SEO.

for example, discourse redirects
https://forums.swift.org/t/could-weburl-pathcomponents-be-standalone////56478
to
https://forums.swift.org/t/could-weburl-pathcomponents-be-standalone/56478

I know there are web APIs which allow changing the URL without reloading the page (such as window.history.pushState). That's how many modern websites offer single-page experiences where each page can retain a unique URL.

I'm not able to advise whether what you're looking for is best done on the client or server side, but for the situation you describe, handling structural path normalisation by issuing redirects, you can check when the output of WebURL's _simplifyPath changes the string. Things like percent-encoding and empty components are application-level, content-based normalisation rather than structural normalisation. It's not really WebURL's place to make those decisions, although it can provide APIs (like _simplyPath) which at least help you to interpret the structure as a URL parser would.

For routing through percent-encoding, WebURL provides lazy percent-decoding, which also seems valuable since you can take advantage of early-exits. You can even track whether each byte is being returned verbatim from the source, or being percent-decoded, if you want to be aware of that.

For empty components, WebURL could offer lower-level access to its path parser - so you could simply decide not to emit them, or to match the path components directly instead of rewriting the path in a buffer. Feel free to hack around with any of this - the source is available, I hope it's easy to follow, and I'm happy to answer any questions if anything is unclear.

Indeed, it simplifies the path as though it were about to be set on the given URL value. It's like using the .path setter, except that instead of writing the normalised path in-place, it writes it to a different buffer (and doesn't add percent-encoding, etc). So I would recommend using a dummy HTTP/S URL (e.g. your server's root URL) to use as context.

That is, unfortunately, as good as it gets right now - there isn't a static version, because this is just something I happen to use in some tests and fuzzers. Its existence is an accident and its semantics are not stable, which is why its hidden away.

BUT for the questions you're asking, about possible future supported WebURL APIs, it's useful for 3 reasons:

Right now it does something which could be useful for your web server. You said you didn't want to pay the allocation cost for the URL's scheme and hostname every time, which is reasonable - and with this, you won't. So do you notice a significant difference? Is it quantifiable? I'd love to know.
You can use it to prototype the consumer end of things - like, what if WebURL had a real, supported API for just normalising a path string? How could you use it, and what else would you ask for?
It is a simple demonstration of how to interact with WebURL's internal _PathParser API. The way to get the best API for your use-case is ultimately to prototype it yourself. Of course, I'm happy to answer any questions about WebURL's source and how to expose any of its logic.