So, firstly - if you want to read raw path components (without automatically unescaping), there is a raw:
subscript, and the UTF8 view includes a pathComponent
function, which tells you exactly where a path component is in the overall string.
We try not to make any assumptions about what path components mean. Percent-encoded UTF-8 text is about as far as we can reasonably go, and there are plenty of situations where even that is too far, which is why those raw APIs exist (e.g. file URL paths might decode to arbitrary bytes rather than UTF-8).
Perhaps. In general I'm happy to expose lower-level APIs to process data as URLs do (even if not stored in a WebURL object). Depending on how low-level they are, they may come with weaker stability guarantees.
Reading
The PathComponents view relies heavily on its path string being normalised. So if you're starting with, say, the path string in a GET request, that means handling all of the weird compatibility quirks - for instance, whether or not backslashes are interpreted as path separators depends on the URL's scheme:
"/foo\bar/baz" - what does it mean?
HTTP: ["foo", "bar", "baz"] -> "/foo/bar/baz"
OTHER: ["foo\bar", "baz"] -> "/foo\bar/baz"
Windows drive letters also mess up how relative references are resolved in file URLs, etc. There's a lot of weird stuff.
In WebURL, the _PathParser
sorts all of that mess out. It has quite a unique implementation; most others (e.g. WebKit, Rust) will allocate a vector to keep track of the path as it is being parsed, but we do it without any heap allocations at all. Doing it this way involved building up a lot of test infrastructure, and exposed a fair number of coverage gaps and bugs in the URL Standard (fixed now; that's the benefit of a living standard), so I'd be happy if people got more use out of it! Using the path parser is relatively straightforward and flexible.
You can try using the SPI, which returns the simplified path string as though it were being set on a URL. After that, once the path is normalised/simplified, reading path components is just splitting on ASCII forward-slashes and percent-decoding as necessary. The parsing/normalisation is the most difficult thing for reading.
Writing
Modifying paths is a whole other bottle of trouble. It's difficult to know which operations are allowed on URL paths, and sometimes it can depend on facts about the other URL components (really).
Exposing that logic would probably be quite difficult. For example, whether or not you can set a URL's path to the empty string (or an empty collection) depends on its scheme, and details like whether or not it has a hostname. Sometimes, depending on its contents, you need to escape the path itself within the URL string.
But yeah, it suggests that perhaps it does make sense for WebURL to offer a freestanding URL path type some day. For now, I don't think it's necessary for v1.0, and the pieces (at least for reading) are semi-exposed if you want to DIY.