WebURL 0.4.0 released! Now with IDN support

0.4.0

This is a big one!

  • WebURL now supports Internationalised Domain Names (IDNs).
  • The URL host parser is now exposed as API, so you can parse hostnames like URLs do.
  • There is a new Domain type, which supports rich processing of domains/IDNs.

IDN support was the missing piece. Now it is done, we can say:

:tada: WebURL fully conforms to the WHATWG URL Standard :tada:

[GitHub]


:globe_with_meridians: Internationalised Domain Names

WebURL now supports Internationalised Domain Names (IDNs):

import WebURL

WebURL("http://中国移动.中国")
// ✅ "http://xn--fiq02ib9d179b.xn--fiqs8s/"

WebURL("https://🛍.example.com/")
// ✅ "https://xn--878h.example.com/"

This may look strange if you are unfamiliar with IDNs. In order to be compatible with existing internet infrastructure, Unicode text in domains needs special compatibility processing, resulting in an encoded string with the distinctive "xn--" prefix. This processing is called IDNA. If somebody wants to register the domain "中国移动.中国", they instead register "xn--fiq02ib9d179b.xn--fiqs8s", and behind the scenes, everything works just like it always did with plain, non-Unicode domains -- importantly, we don't need internet routing infrastructure or applications to process hostnames differently to how they normally would. This encoded version is not very helpful to humans, but browsers and applications can detect these domains and present them in Unicode (we have APIs for that; more info below).

For more information about IDNs see IDN World Report.

Browsers are making an increased effort this year to align their own IDNA implementations (Safari/WebKit already conforms), and it has been announced that Apple's next major operating system releases will include support in Foundation URL. Now WebURL also implements this part of the URL Standard, it is available now, and it fully backwards-deploys. It's important that URLs work consistently for everybody, and WebURL can help with that.

What's more - since this processing happens in the URL type, it works with our existing Foundation interop:

import WebURL
import Foundation
import WebURLFoundationExtras

let (data, _) = try await URLSession.shared.data(for: WebURL("http://全国温泉ガイド.jp")!)
// ✅ Works

let convertedToURL = URL(WebURL("http://全国温泉ガイド.jp")!)!
// ... continue processing 'convertedToURL' as you normally would

Developers have been asking for better IDN support across the industry for years - at this stage of adoption, most IDNs are in China, so Chinese developers in particular have been wanting to work with these kinds of URLs. I'm especially pleased that WebURL is now able to offer it to any Swift application.

:open_book: Host Parsing API

IDN support as the standard requires is great and all, but it isn't enough.

URLs are designed to be universal - infinitely customisable. There are some "special" schemes which the standard knows about, such as http:, and while their hosts have semantic meaning (they are network addresses, hence we should use IDNA, detect IPv4 addresses, etc), generally, for other schemes, the host is just an opaque string and is not interpreted.

That's the correct model, but frequently we are processing URLs which are very HTTP-like, and we would like to support the same network addresses, in the same way, as an HTTP URL. For instance, suppose we were writing an application to handle ssh: URLs - the standard would only parse IPv6 addresses out for us, and everything else would just be an opaque string.

WebURL("ssh://karl@somehost/")!.host
// 😐 .opaque, "somehost"

WebURL("ssh://karl@abc.أهلا.com/")!.host
// 😕 .opaque, "abc.%D8%A3%D9%87%D9%84%D8%A7.com"

WebURL("ssh://karl@192.168.0.1/")!.host
// 🤨 .opaque, "192.168.0.1"

Request libraries generally need to write their own parsers to handle this, but it is difficult to match the host parser for HTTP URLs exactly... unless, of course, you are the URL host parser :thinking:...

So with 0.4.0, WebURL's Host type exposes the URL host parser directly to your applications. Not only is this great for processing URLs of any scheme, it's also useful for hostnames provided via command-line interfaces or configuration files. Being able to guarantee the host is interpreted the same way as it would be in an http: URL is a very useful property, just by itself.

WebURL.Host("EXAMPLE.com", scheme: "http")
// 😍 .domain, Domain { "example.com" }

WebURL.Host("abc.أهلا.com", scheme: "http")
// 🤩 .domain, Domain { "abc.xn--igbi0gl.com" }

WebURL.Host("192.168.0.1", scheme: "http")
// 🥳 .ipv4Address, IPv4Address { 192.168.0.1 }

:duck: Domain API

Exposing the host parser is great and all, but it also isn't enough.

Previously, we only had types for IPv4 and IPv6 addresses, and domains were represented as Strings. Now, domains have their own type - WebURL.Domain, which is guaranteed to contain a validated, normalised domain from the URL host parser, and can be a useful place to house APIs which operate on domains.

WebURL.Domain("example.com")  // ✅ "example.com"
WebURL.Domain("localhost")    // ✅ "localhost"
WebURL.Domain("api.أهلا.com")  // ✅ "api.xn--igbi0gl.com"
WebURL.Domain("xn--caf-dma")  // ✅ "xn--caf-dma" ("café")

WebURL.Domain("in valid")     // ✅ nil (spaces are not allowed)
WebURL.Domain("xn--cafe-yvc") // ✅ nil (invalid IDN)
WebURL.Domain("192.168.0.1")  // ✅ nil (not a domain)

The most important API right now is render, which builds a result using an encapsulated algorithm. There is opportunity for renderers to produce any kind of result - for example, they might perform spoof-checking to guard against confusable text, or they might use a database to shorten domains to their most important section, or they might have special formatting for particular domains. You can create a renderer by conforming to the WebURL.Domain.Renderer protocol.

WebURL comes with an uncheckedUnicodeString renderer, so you can recover the Unicode form of a domain. This renderer does not perform any spoof-checking, so is not recommended for use in UI.

let domain = WebURL.Domain("xn--fiq02ib9d179b.xn--fiqs8s")!
domain.render(.uncheckedUnicodeString)
// ✅ "中国移动.中国"

And with that, I'm happy with WebURL's host story. It provides rich, detailed information about the hosts defined in the URL Standard and gives you the means to easily and robustly process them. Please try it out and leave feedback!

:gift: Bonus: Spoof-checked renderer prototype

It is important that applications use spoof checking when displaying domains in Unicode form. We have a proof-of-concept renderer which ports much of Chromium's IDN spoof-checking logic. It works on my Mac, but deploying it can be a pain because it depends on the ICU library for its implementation of UAX39.

// Non-IDNs.
WebURL.Domain("paypal.com")?.render(.checkedUnicodeString) // ✅ "paypal.com"
WebURL.Domain("apple.com")?.render(.checkedUnicodeString)  // ✅ "apple.com"

// IDNs.
WebURL.Domain("a.أهلا.com")?.render(.checkedUnicodeString)   // ✅ "a.أهلا.com"
WebURL.Domain("你好你好")?.render(.checkedUnicodeString)     // ✅ "你好你好"

// Spoofs.
WebURL.Domain("раγpal.com")?.render(.checkedUnicodeString) // ✅ "xn--pal-vxc83d5c.com"
WebURL.Domain("аpple.com")?.render(.checkedUnicodeString)  // ✅ "xn--pple-43d.com"

It would be great to turn this in to a maintained, easily-deployable package. I'm too busy right now, so it remains a prototype, but maybe one day? Or if anybody else would like to get involved, they can use it as a starting point.

Bugfixes

  • Fixed a crash when appending an empty array of form params (#140). Thanks to adam-fowler for the report. Sorry it took so long to get in to a release.
14 Likes

I've just released 0.4.1

If you are testing applications with TSan enabled, I highly recommend updating, as otherwise you may see false reports of data races or even crashes within the TSan runtime - the latter effectively making it impossible to test your application.

Here's a reduced example which crashes TSan on Xcode 14
// tsan-test.swift
struct MYRNG: RandomNumberGenerator {
    mutating func next() -> UInt64 { return 1 }
}

func returnNext<R: RandomNumberGenerator>(using generator: inout R) -> UInt64 {
    return generator.next()
}

func callReturnNext<R: RandomNumberGenerator>(_ array: [Int], using generator: inout R) -> UInt64 {
    return returnNext(using: &generator)
}

import Dispatch

DispatchQueue.concurrentPerform(iterations: 10_000) { _ in
    var g = MYRNG()
    _  = callReturnNext([0], using: &g)
}
swiftc -sanitize=thread tsan-test.swift -o tsan-test.out && TSAN_OPTIONS="verbosity=3" ./tsan-test.out
==34511==ERROR: ThreadSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0001052d7687 bp 0x7ff7bb11a4f0 sp 0x7ff7bb11a4b0 T6420932)
==34511==The signal is caused by a READ memory access.
==34511==Hint: address points to the zero page.
==34511==Launching Symbolizer process: /usr/bin/atos -p 34511 
ThreadSanitizer:DEADLYSIGNAL
    #0 __tsan::MemoryAccess(__tsan::ThreadState*, unsigned long, unsigned long, int, bool, bool) <null>:2 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x60687)
    #1 __tsan::ExternalAccess(void*, unsigned long, void*, unsigned long) <null>:2 (libclang_rt.tsan_osx_dynamic.dylib:x86_64+0x2adbd)
    #2 callReturnNext<A>(_:using:) <null>:2 (tsan-test.out:x86_64+0x100003ade)
    #3 closure #1 in  <null>:2 (tsan-test.out:x86_64+0x100003bb0)
[...]

TSan Workaround

This release includes a workaround for a bug in TSan.
PR #168
Issue #166 (thanks to @shadowfacts for the report)

TSan's internal bookkeeping seems to be corrupted if you use pass around an empty struct as an inout parameter. This pattern is sometimes used by generic algorithms (for example, the standard library's SystemRandomNumberGenerator is an empty struct), and is used internally by WebURL. There is no actual data race, but the corruption of TSan's bookkeeping data can lead to spurious reports of data races or even null-pointer dereferences within the TSan runtime.

To work around this, we add an unused field to these empty structs in debug builds.

Related bug reports: apple/swift#61073 apple/swift#61244 and apple/swift#56405

Improvements to Testing

Additionally, some tests to Foundation extensions have been refactored, and the "Swifter" HTTP server dependency that was used by some tests has been dropped.

2 Likes