IDN/Punycode in URL

It looks like URL() does not support IDN (internationalized domain names such as www.bücher.de or www.académie-française.fr).

Is there any standard method available to transform these IDN to Punycode to be used with URL()?

I know of third-party libraries doing this apparently (Punycode on CocoaPods.org).

2 Likes

I'm currently working on a pure-Swift URL parser based on the WHATWG spec, which should handle this and a bunch of other Foundation URL oddities and inefficiencies. It's too early to show yet, and I haven't attempted any optimisation of the prescribed parsing algorithm. I may have something to show in 7-10 days, depending on how much time I get to work on it. The standard library is missing some features that would really help, like being able to mutate the String's internal code-units (and optionally to bypass String's internal UTF8 validation while doing so, since we already handle non-ASCII codepoints in the parser), but discovering those limitations is part of what I'm doing it.

Ultimately my goal is to hook it up to NIO to provide a truly open, cross-platform, non-blocking filesystem and network interface.

Sorry if it's not too helpful, it was just really interesting to me that you mentioned this while I'm working on the same thing.

7 Likes

I want to echo this, and express my desire of seeing URL natively handling internationalised URIs.

I discovered this problem yesterday, and considered writing an evolution pitch about it, but then remembered that the evolution process doesn't really cover Foundation APIs.

What makes this problem so difficult to me is that URL already conforms to Codable. I can percent-encode the strings by myself before passing it into URL.init(string:). However, to make the percent-encoding part work with Codable, I must either create a custom type wrapping Foundation's URL, or write custom boilerplate code for init(from decoder:) and encode(to encoder:). If I choose the custom type route, I'll need to recreate/redirect every URL property I need, and still need to convert the custom type to URL before passing it to any function that takes URL instances as parameters. If I choose the custom Codable conformance route, I'll lose automatic Codable conformance for URL, and all other properties of other types that reside in the same scope. Also, the boilerplate code easily grows repetitive.

Not in the standard library, but NSString has addingPercentEncoding(withAllowedCharacters:). StringProtocol (which is in the standard library) has the same method, too, but I'm not able to locate it in the source code.

It's a lot of work to use it properly. For example, if you have a URL like http://example.com/引き割り.html, this is what happens if you encode the entire thing:

let internationalisedURLString = "http://example.com/引き割り.html"

let percentEncodedURLString = internationalisedURLString
    .addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!
//  percentEncodedURLString: String = 
//  "http%3A//example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html"

let percentEncodedURL = URL(string: percentEncodedURLString)

print(percentEncodedURL!.path)
//  http://example.com/引き割り.html

print(percentEncodedURL!.absoluteURL)
//  http%3A//example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

Notice how the entire URL string is encoded as if it contains only the path component? Since addingPercentEncoding(withAllowedCharacters:) is an NSString/StringProtocol function, it's not really its job to recognise the string's shape and give special treatment to URL strings.

One way to handle this is to parse the URL structure by yourself, then encode each component with the .urlHostAllowed, .urlPathAllowed, .urlQueryAllowed, and etc. options.

Another way is looping through the string and encoding each character individually:

let internationalisedURLString = "http://example.com/引き割り.html"

var percentEncodedURLString = ""
//  percentEncodedURLString: String = 
//  "http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html"

internationalisedURLString.forEach { character in
	if character.isASCII {
		percentEncodedURLString.append(character)
	} else {
		let percentEncodedCharacter = String(character)
            .addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!
		percentEncodedURLString.append(percentEncodedCharacter)
	}
}

let percentEncodedURL = URL(string: percentEncodedURLString)

print(percentEncodedURL!.path)
//  /引き割り.html

print(percentEncodedURL!.absoluteURL)
//  http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

The above method is very inefficient, since it rebuilds the percent-encoded string for as many times as there are characters in the original string. It also doesn't differentiate between different components of an URL, and uses CharacterSet.urlPathAllowed for all characters in the string. I haven't been able to spot any differences between using .urlHostAllowed, .urlPathAllowed, and such. They don't have proper documentations, and I haven't been able to trace each character set's exact definition in the open source implementation.

For proper escaping it’s better to use URLComponents as it knows how to differentiate between the parts of a URL and the required escaping.

When the URL (URI) is internationalised, URLComponents faces the same problem as URL:

let internationalisedURLComponents = 
    URLComponents(string: "http://example.com/引き割り.html")!
//  Fatal error: Unexpectedly found nil while unwrapping an Optional value

It can't parse internationalised strings.

Yes, but assigning the components individually should yield a valid, escaped URL.

And technically, I don’t think that’s a valid URL since the path isn’t escaped, so that’s what I’d expect.

Do you mean something like this then:

var internationalisedURLComponents = URLComponents()

internationalisedURLComponents.scheme = "http"
internationalisedURLComponents.host = "example.com"
internationalisedURLComponents.path = "引き割り.html"

print(internationalisedURLComponents)
//  scheme: http host: example.com path: 引き割り.html

print(internationalisedURLComponents.path)
//  %E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

This works, but only if all the components are not partially encoded.

Here is a real-world URL I came across today that's partially encoded, and accepted by browsers:

var internationalisedURLComponents = URLComponents()

internationalisedURLComponents.scheme = "http"
internationalisedURLComponents.host = "spacedock.info"
internationalisedURLComponents.path =
    "mod/1016/Tradução%20para%20Português%20Brasileiro"

print(internationalisedURLComponents.percentEncodedPath)
//  mod/1016/Tradu%C3%A7%C3%A3o%2520para%2520Portugu%C3%AAs%2520Brasileiro

//  but it should be
//  mod/1016/Tradu%C3%A7%C3%A3o%20para%20Portugu%C3%AAs%20Brasileiro

If any part of an URL is partially encoded, percentEncodedPath will double encode it.

I'm not sure if unescaped URLs are invalid, but browsers support them.

The bigger issue though, is that often it's not up to the user's choice whether an URL is escaped. And if those unescaped URLs come through some serialised data, it makes it impossible to take advantage of URL and URLComponent's automatic Codable conformance

If browser compatibility is your goal, no Swift APIs will work. Browser URL parsers are complex, a combination of standard parsing and heuristics designed for human inputs. You’re right, Swift needs its own URI type that can go beyond the Foundation types. There may be existing types that get you farther down this path. Unfortunately Apple has been reluctant to offer official replacements or competitors to Foundation APIs so far, so a third party library is your best bet.

Browser compatibility will be nice to have.

This sounds interesting. Do you know some good resources where I can learn more about it?

Personally, I'm most interested in the automatic Codable conformance when parsing unescaped URLs.

I tried to find WebKit's URL or URI type but couldn't, as I'm not familiar with the codebase. I'm not sure there's a fully featured Swift alternative. But I know it combines standard parsing with compatibility logic for browser use.

How is the parser development going, if you don't mind me asking?

1 Like

Sure. I haven't had nearly as much time to work on it as I would have liked, but I'm back at it now. You can check out my progress at GitHub - karwa/base at url (in "Sources/URL"), although I will warn that it is pretty rough. I'm mostly concerned with getting the algorithm correct, then I'll worry about cleaning up the various utility functions I added along the way.

The main things I still have to do are:

  • Finishing host parsing: IPv4 and v6 addresses are done, and match libc's inet_aton and inet_pton. I spent a little bit more time on them since it's a nicely encapsulated set of functionality. Only "opaque host" and domain parsing (including IDN/punycode stuff) remains.
  • Serialisation
  • Figuring out the API (including the name - I've called it XURL as a placeholder; I don't think it would be wise to have a second type called URL with different behaviour, and we all know that the 'X' makes it cool)
  • Lots and lots and lots of tests
  • Optimising the layout. Basically the only way to tell if a URL is valid is to parse it and see if it fails, which could result in loads of heap allocations depending on the lengths of the Strings. I'd like to see if it's possible to use "shared strings" to have them share a single allocation, and maybe have some kind of in-line storage for paths/query strings without many components.

The main parsing algorithm is essentially complete, save for a couple of clearly-marked TODOs (e.g. query parameters).

1 Like

Would you accept pull requests?

What do you think of URI? I've entertained myself with the idea of a URI type that supersedes URL and parses URNs as well.

In principle, sure - although as I said, I’m still actively working on it. To use a cooking analogy, somebody washing up the knives and pans as I use them is likely to just get in the way, but if somebody wants to prepare the meat while I’m chopping vegetables, that’s a welcome help. So little things like tidying up utility functions I already plan to replace isn’t much help, but working on something bigger like serialisation or query parameters would be nice.

As it happens, one of the goals of the spec is actually to standardise on the term “URL”. So it’s probably worth keeping that name somewhere, but I don’t have a strong opinion on it.

1 Like