I'm currently working on a pure-Swift URL parser based on the WHATWG spec, which should handle this and a bunch of other Foundation URL oddities and inefficiencies. It's too early to show yet, and I haven't attempted any optimisation of the prescribed parsing algorithm. I may have something to show in 7-10 days, depending on how much time I get to work on it. The standard library is missing some features that would really help, like being able to mutate the String's internal code-units (and optionally to bypass String's internal UTF8 validation while doing so, since we already handle non-ASCII codepoints in the parser), but discovering those limitations is part of what I'm doing it.
Ultimately my goal is to hook it up to NIO to provide a truly open, cross-platform, non-blocking filesystem and network interface.
Sorry if it's not too helpful, it was just really interesting to me that you mentioned this while I'm working on the same thing.
I want to echo this, and express my desire of seeing URL natively handling internationalised URIs.
I discovered this problem yesterday, and considered writing an evolution pitch about it, but then remembered that the evolution process doesn't really cover Foundation APIs.
What makes this problem so difficult to me is that URL already conforms to Codable. I can percent-encode the strings by myself before passing it into URL.init(string:). However, to make the percent-encoding part work with Codable, I must either create a custom type wrapping Foundation's URL, or write custom boilerplate code for init(from decoder:) and encode(to encoder:). If I choose the custom type route, I'll need to recreate/redirect every URL property I need, and still need to convert the custom type to URL before passing it to any function that takes URL instances as parameters. If I choose the custom Codable conformance route, I'll lose automatic Codable conformance for URL, and all other properties of other types that reside in the same scope. Also, the boilerplate code easily grows repetitive.
It's a lot of work to use it properly. For example, if you have a URL like http://example.com/引き割り.html, this is what happens if you encode the entire thing:
let internationalisedURLString = "http://example.com/引き割り.html"
let percentEncodedURLString = internationalisedURLString
.addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!
// percentEncodedURLString: String =
// "http%3A//example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html"
let percentEncodedURL = URL(string: percentEncodedURLString)
print(percentEncodedURL!.path)
// http://example.com/引き割り.html
print(percentEncodedURL!.absoluteURL)
// http%3A//example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html
Notice how the entire URL string is encoded as if it contains only the path component? Since addingPercentEncoding(withAllowedCharacters:) is an NSString/StringProtocol function, it's not really its job to recognise the string's shape and give special treatment to URL strings.
One way to handle this is to parse the URL structure by yourself, then encode each component with the .urlHostAllowed, .urlPathAllowed, .urlQueryAllowed, and etc. options.
Another way is looping through the string and encoding each character individually:
let internationalisedURLString = "http://example.com/引き割り.html"
var percentEncodedURLString = ""
// percentEncodedURLString: String =
// "http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html"
internationalisedURLString.forEach { character in
if character.isASCII {
percentEncodedURLString.append(character)
} else {
let percentEncodedCharacter = String(character)
.addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!
percentEncodedURLString.append(percentEncodedCharacter)
}
}
let percentEncodedURL = URL(string: percentEncodedURLString)
print(percentEncodedURL!.path)
// /引き割り.html
print(percentEncodedURL!.absoluteURL)
// http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html
The above method is very inefficient, since it rebuilds the percent-encoded string for as many times as there are characters in the original string. It also doesn't differentiate between different components of an URL, and uses CharacterSet.urlPathAllowed for all characters in the string. I haven't been able to spot any differences between using .urlHostAllowed, .urlPathAllowed, and such. They don't have proper documentations, and I haven't been able to trace each character set's exact definition in the open source implementation.
When the URL (URI) is internationalised, URLComponents faces the same problem as URL:
let internationalisedURLComponents =
URLComponents(string: "http://example.com/引き割り.html")!
// Fatal error: Unexpectedly found nil while unwrapping an Optional value
This works, but only if all the components are not partially encoded.
Here is a real-world URL I came across today that's partially encoded, and accepted by browsers:
var internationalisedURLComponents = URLComponents()
internationalisedURLComponents.scheme = "http"
internationalisedURLComponents.host = "spacedock.info"
internationalisedURLComponents.path =
"mod/1016/Tradução%20para%20Português%20Brasileiro"
print(internationalisedURLComponents.percentEncodedPath)
// mod/1016/Tradu%C3%A7%C3%A3o%2520para%2520Portugu%C3%AAs%2520Brasileiro
// but it should be
// mod/1016/Tradu%C3%A7%C3%A3o%20para%20Portugu%C3%AAs%20Brasileiro
If any part of an URL is partially encoded, percentEncodedPath will double encode it.
I'm not sure if unescaped URLs are invalid, but browsers support them.
The bigger issue though, is that often it's not up to the user's choice whether an URL is escaped. And if those unescaped URLs come through some serialised data, it makes it impossible to take advantage of URL and URLComponent's automatic Codable conformance
If browser compatibility is your goal, no Swift APIs will work. Browser URL parsers are complex, a combination of standard parsing and heuristics designed for human inputs. You’re right, Swift needs its own URI type that can go beyond the Foundation types. There may be existing types that get you farther down this path. Unfortunately Apple has been reluctant to offer official replacements or competitors to Foundation APIs so far, so a third party library is your best bet.
I tried to find WebKit's URL or URI type but couldn't, as I'm not familiar with the codebase. I'm not sure there's a fully featured Swift alternative. But I know it combines standard parsing with compatibility logic for browser use.
Sure. I haven't had nearly as much time to work on it as I would have liked, but I'm back at it now. You can check out my progress at GitHub - karwa/base at url (in "Sources/URL"), although I will warn that it is pretty rough. I'm mostly concerned with getting the algorithm correct, then I'll worry about cleaning up the various utility functions I added along the way.
The main things I still have to do are:
Finishing host parsing: IPv4 and v6 addresses are done, and match libc's inet_aton and inet_pton. I spent a little bit more time on them since it's a nicely encapsulated set of functionality. Only "opaque host" and domain parsing (including IDN/punycode stuff) remains.
Serialisation
Figuring out the API (including the name - I've called it XURL as a placeholder; I don't think it would be wise to have a second type called URL with different behaviour, and we all know that the 'X' makes it cool)
Lots and lots and lots of tests
Optimising the layout. Basically the only way to tell if a URL is valid is to parse it and see if it fails, which could result in loads of heap allocations depending on the lengths of the Strings. I'd like to see if it's possible to use "shared strings" to have them share a single allocation, and maybe have some kind of in-line storage for paths/query strings without many components.
The main parsing algorithm is essentially complete, save for a couple of clearly-marked TODOs (e.g. query parameters).
In principle, sure - although as I said, I’m still actively working on it. To use a cooking analogy, somebody washing up the knives and pans as I use them is likely to just get in the way, but if somebody wants to prepare the meat while I’m chopping vegetables, that’s a welcome help. So little things like tidying up utility functions I already plan to replace isn’t much help, but working on something bigger like serialisation or query parameters would be nice.
As it happens, one of the goals of the spec is actually to standardise on the term “URL”. So it’s probably worth keeping that name somewhere, but I don’t have a strong opinion on it.