WebURL now supports Unicode domain names (IDNA)

Hi!

I'm happy to announce that I just merged support for Unicode domain names (IDNA) in WebURL.
It's available right now, on the main branch, and it's a big milestone for a lot of reasons.

Note: IDNA support is on main. There will be a release in the ~next few weeks which includes it, but for now you should use a branch-based dependency for IDNA.

With this post, I'd like to discuss what IDNA is, why you should support it, how WebURL supports it, and why I think that's such a big deal.

What is IDNA?

Firstly, IDNA stands for "Internationalizing Domain Names for Applications". It is defined by Unicode Technical Standard 46, and this is how they describe it:

One of the great strengths of domain names is universality. The URL http://Apple.com goes to Apple's website from anywhere in the world, using any browser. [...]

Initially, domain names were restricted to ASCII characters. This was a significant burden on people using other characters. Suppose, for example, that the domain name system had been invented by Greeks, and one could only use Greek characters in URLs. Rather than apple.com, one would have to write something like αππλε.κομ. An English speaker would not only have to be acquainted with Greek characters, but would also have to pick those Greek letters that would correspond to the desired English letters. One would have to guess at the spelling of particular words, because there are not exact matches between scripts.

Most of the world’s population faced this situation until recently, because their languages use non-ASCII characters. A system was introduced in 2003 for internationalized domain names (IDN). This system is called Internationalizing Domain Names for Applications. [...]

In a nutshell, it's Unicode - in domain names. And it's good for the reasons Unicode is good. It means you can have URLs like:

Each top-level domain (TLD) sets its own limits on which characters may be registered - for example, JPRS, responsible for Japan's .jp country-code TLD, have decided that only ASCII, Kanji, Hiragana, and Katakana symbols may be used in a .jp domain. Many TLDs restrict or forbid Emoji in domain names, but some are less proscriptive, such as the .fm and .ws TLDs:

More information: Wikipedia - Emoji Domain

So no Emoji URLs for me? :pleading_face:

Fret not! IDNA applies to the entire domain - including subdomains! So you can totally have emoji subdomains on regular, ASCII domains you already own. And they will work in URLs.

Sadly, none of these point anywhere, but they're all technically valid! Alternatively, you could add localized variants of existing subdomains, such as support domains:

How does WebURL support IDNA?

Now that WebURL supports IDNA, all of the above URL strings can be successfully parsed by the WebURL.init?(String) initializer, and the .hostname property setter accepts Unicode domains. Previously, these operations would fail.

let url = WebURL("https://국립중앙도서관.한국/")!  // ✅ works

var url = WebURL("https://example.com/")!
url.hostname = "日本語.jp"  // ✅ works

Now, URLs do not really support Unicode - it turns out to be quite important that URL strings are always plain ASCII, and DNS is far more restrictive even than that. So how does it work now that URL components can have Unicode contents?

Similarly to the way WebURL normalizes other URL components (such as by collapsing .. segments in paths or adding percent-encoding), Unicode domains are normalized using the ToASCII algorithm defined by UTS46, with parameters defined by the URL Standard. The algorithm normalizes, case-folds, and applies compatibility mappings to the domain, before encoding it as ASCII using an encoding format known as Punycode.

All of that happens automatically when you perform either of the above operations. After parsing a URL or setting its hostname, you will find that it has been converted to ASCII.

Unicode ASCII
http://招商银行.中国 http://xn--czrx92avj3aruk.xn--fiqs8s/
https://국립중앙도서관.한국/ https://xn--zb0b2h01ozygv9j7lgn8g.xn--3e0b707e/
https://we❤️swift.fm https://xn--weswift-z98d.fm/
https://🛍.example.com/ https://xn--878h.example.com/
let url = WebURL("https://招商银行.中国/")!
print(url.hostname)  // "xn--czrx92avj3aruk.xn--fiqs8s"

There are a couple of interesting things to point out:

  • Unicode text is normalized and case-folded before Punycode.

    We all know that Unicode is complex. The strings caf\u{00E9}.fr and cafe\u{0301}.fr do not contain the same bytes, nor do they even contain the scalars, but something called "Unicode Canonical Equivalence" says we should treat these as the same string.

    So how does that apply to domains? Do routers and caches need to check canonical equivalence to tell if two domains are the same? Does SSL certificate validation depend on Unicode canonical equivalence?

    No. With IDNA, canonically-equivalent strings (or strings which differ only in case) produce the same ASCII result. It doesn't matter whether the hostname is caf\u{00E9}.fr, cafe\u{0301}.fr, or CAFE\u{0301}.fr - they all return http://xn--caf-dma.fr/. All of your expectations about how URLs and ASCII strings work are maintained, and you can more-or-less forget that it represents Unicode contents.

  • IDNA is applied per-label.

    Notice how 🛍.example.com became xn--878h.example.com in our subdomain example? IDNA applies to each segment of a domain individually (known as a "label"), so any code or routing rules which expect to see *.example.com will continue to see that.

    The theme here is compatibility - essentially, the Unicode portions (and only those potions) end up as funny-looking ASCII segments, and everything "just works".

  • URL and URLSession.

    Because IDNA is designed to support legacy systems, the Unicode -> ASCII conversion is all handled by WebURL. And since WebURL has excellent interoperability with Foundation, you can now make requests to Unicode domains using URLSession and other Foundation APIs.

    let page = try! String(contentsOf: WebURL("http://招商银行.中国")!)  // ✅ Works
    
    let (data, _) = try await URLSession.shared.data(from: WebURL("https://日本語.jp")!) // ✅
    

    The same applies to our WebURL-native port of async-http-client, which performs true, web-compatible URL processing throughout the entire request process. It now also supports IDNA.

  • And it back deploys.

    WebURL supports all Apple platforms, Linux, and Windows. It is a pure package implementation, so you can guarantee IDNA support for users no matter which OS they're running. It only requires a Swift 5.3+ compiler to build, so anybody should be able to use it.

Why is it such a big deal for WebURL to support IDNA?

IDNA has had a bit of a rough launch. The first IDN TLDs were approved in 2010, and since then progress has been... a bit mixed:

It's a little underwhelming, but there are some technical issues which may shed light on why that is. I'd like to draw your attention to this infographic by IDN World Report (a joint research project from the EU, Coordination Center, and UNESCO):

And specifically to this part:

Currently, software support for IDNA is poor. As is tradition for anything URL-related, there are multiple, incompatible revisions of the standard, with some domains producing different results depending on which revision you use, or being valid in one but not the other.

Yeah. This again. Uhhhhhh :weary:

We see these differences in browsers - German speakers can't have 'ß' in domains (Chrome turns it in to 'ss' but Safari turns it in to 'xn--zca'), and URLs such as https://www.👨‍🦰.tk/ only work in Chrome (but should be considered invalid). The current state of the browsers is that Safari is fully compliant with the latest version of the standard, followed by Firefox, and Chrome (the browser with by far the largest market-share worldwide) is the least compliant. That's why the major browser developers have made decided to make alignment on IDNA an Interop-22 priority item for the web.

But take another look at the devices in that picture. Seem familiar?

Hey - wait a second! Servers? Smartphones? That's our house!!!

It even has a notch!

So yeah - besides browsers, Smartphones have, of course, been the primary driver of increased internet connectivity across the entire world over the last decade while IDNA was seeking greater adoption -- and yet, core system frameworks such as Foundation on iOS have lacked support. That should underscore just how poor the IDNA experience has been across the industry so far - browsers may/may not work (can you expect that somebody will be able to open an IDN URL from an email? Who knows? :man_shrugging: Probably not), and if you used an iPhone or iPad, you can basically guarantee that nothing would support IDNA unless the developer went to an exceptional amount of effort to specifically support them. And the same is true for Swift on Server.

When you consider all of this, I think it's clear that IDNA hasn't been given a fair shake. Maybe it'll catch on, or maybe it won't - but that should be decided by its ability to make life better for the people it is designed to include, not by unreliable software killing the project with a lack of interoperability.

And now?

It was recently announced that Foundation.URL will gain support for IDNA with the next major OS release, expected later this year, which is fantastic news.

WebURL offers something extra - not only the latest URL standard, but now also backwards deployable support for IDNA, building on all the work we've done for interoperability with existing Foundation.URL code-bases. And because it's actually a Swift-native implementation, we can offer more advanced APIs for analyzing and rendering Unicode domains.

And it's available today. Right now.

So that's why I think this is a big deal. With package-based IDNA support that plugs in to existing applications, there is no longer any reason for Swift applications to not support IDNA.

At the start of this year, IDNA support across the industry was pretty thin. But now, with a renewed focus from browser vendors to align their implementations to the latest standard, and with IDNA support coming in Apple's system libraries, and backwards-deployable with WebURL, IDNA's future is looking much more promising.

I encourage you all to check whether your applications can support IDNA, and it's yet another reason to try out WebURL in your apps.

:wave:

30 Likes

Also, this is a big milestone for another reason - it means WebURL now fully conforms to the URL standard. No gaps. :partying_face:

It has been a really long journey to get this far. A lot of this stuff involves things you only really learn by doing - generally, as a user of a library, you don't care how most of these things work (and you shouldn't have to!).

I've had to learn a lot - about the weird legacy quirks in URLs, and how, throughout their history, there constantly been a mismatch between how developers expect URLs to behave and how they actually do. That has lead to a catalogue of security problems - security blunders, really - that could have been avoided. And I have had to try (and am still trying) to distill all of that, and apply it, to create a "next-gen" Swift API that is actually easy to use correctly; something that takes URLs to a new level of expressivity, gives us new capabilities, and interoperates with other machines based on modern standards.

It's a really unique sort of thing. It might not seem it at first, but trying to organise all of these decades of cruft and chaos in to an elegant, modern library is a very worthy challenge. It has been really rewarding, and there is still more to come.

So, thank you very much to everybody who has/is using WebURL, or has supported the project at all. I've never earned anything from it, but I'm motivated to continue for moments like this - I know that a lot of people have wanted this feature for a long time, and I'm thrilled that WebURL can finally deliver it.

25 Likes

Very impressive Karl, congrats.

1 Like

great! The README on GitHub (GitHub - karwa/swift-url: A new URL type for Swift) should be updated. It currently says: " It currently does not support Internationalized Domain Names (IDNA), but that support is planned."

1 Like

Yes, it isn't part of an actual release yet. I'm going to update that (and there are various places in the DocC documentation which also need updating).

The reason I want to let people know a little bit in advance of that is because it is a new implementation, and even though it passes the Unicode Consortium's test suite, it's helpful if early adopters can try it out and report any issues before it goes to everyone.


Also, there's one more WebURL-related thing that people might find cool, but I haven't had an opportunity to talk about before: Rendering.

Notice how we're able to style each component separately, including internal structure like path component, query parameters, and their internal delimiters. This is built using public APIs (specifically, the UTF8View - I keep saying it's super cool and hopefully this shows why :slight_smile:), although the query parameters from this picture are using a new API that's still WIP. You can elide components (as the 'mono' style is doing, to omit the scheme), and I'd like to come up with some good heuristics for when that is appropriate.

You can find the implementation here; it produces an attributed string, styled by a stylesheet. The two styles above are implemented here. Try it out! Let me know what you come up with!

Currently, my thinking for IDN rendering is that we'd have a Domain type, including APIs for rendering a domain as Unicode; and for full-URL rendering, we'd do something along these lines. Renderers would visit the URL components and produce some kind of value - probably typically just a regular String, but fancier things like attributed strings or even HTML should be possible.

It would be the render's job to decide whether or not to decode an IDN. Part of their decision may be to decode the IDN, but display it in a particular colour, or with a particular font or character spacing, etc. That kind of detailed presentation API would give us far and away the best IDN support in any URL library, IMO.

If you have to show URLs in your application, why not make them look a bit nicer? And the use of text styles can help to make the meaning clearer (e.g. highlighting the hostname section, or highlighting "https" in green, etc).