URL FormatStyle and ParseStrategy

Hi all! The Foundation team is working on bringing FormatStyle and ParseStrategy to URL. We are introducing two types, URL.FormatStyle and URL.ParseStrategy, as a simple, familiar, and error-proof interface for you to configure URL format styles and parse strategies with sensible defaults. We are interested in everyone's feedback on these types. In particular, please let us know:

  • What should URL.FormatStyle's default configuration be? Which components should we display and which ones should we hide?
  • Is ComponentParseStrategy (see details below) necessary? Can it be replaced by functions with parameters? For example, do you prefer strategy.port(.defaultValue(80)) or strategy.port(required: false, defaultValue: 80)?

Thanks, and let me know your thoughts!


URL FormatStyle and ParseStrategy

  • Proposal: FOU-NNNN
  • Author(s): Charles Hu
  • Status: Active review

Introduction

Foundation introduced a new pattern for formatting and parsing currency types such as Date and Measurement last year. This proposal aims to provide similar formatting and parsing capabilities to URL via the introduction of URL.FormatStyle and URL.ParseStrategy. We hope to provide a clear, familiar, and error-proof interface for the developers to configure URL format styles and parse strategies with sensible defaults, similar to the rest of the FormatStyle and ParseStrategy family.

Motivation

URL formatting is a frequently requested feature with many high-profile use cases such as the address bar in Safari and link previews in Messages. We want to provide a Foundation API that: 1) covers all essential features that these use cases need; and 2) provides a concise and familiar API surface by conforming to the existing FormatStyle and ParseStrategy protocols.

In addition to parsing URLs, URL.ParseStrategy will also participate in regex pattern matching alongside the other Foundation parse strategies, with the new String Processing library (which will eventually be part of the standard library). You will be able to use URL.ParseStrategy as part of the regex builder to match URLs directly.

Proposed Solution and Example

We propose to introduce two types, URL.FormatStyle and URL.ParseStrategy that behave exactly like the other format styles and parse strategies we have today. In other words, you'll be able to format an URL like this:

let url = URL("https://charles:password@www.example.com:8080/search?color=red#product")
// The default style:
// - Always displays the scheme, host and path
// - Never displays user, password, query, and fragment
// - Omits port if the scheme is HTTP family ("https" or "http")
let displayString = url.formatted() // https://www.example.com/search

// You can also configure the display strategy for each component:
let custom = url.formatted(
    .url.scheme(.always)
        .host(.omitSpecificSubdomains(["www"]))
        .port(.always)) // https://example.com:8080/search

As a safety feature, URL.FormatStyle will attempt to mitigate IDN homograph attacks. Simply put, URL.FormatStyle will display the Punycode encoded hostname if the hostname contains "lookalike" Unicode characters:

// Not apple.com
print(URL("https://аррІе.com/").formatted())  // https://xn--80ak6aa4i.com
// Not google.com
print(URL("https://gooِgle.com/").formatted()) // https://xn--google-yri.com

URL.ParseStrategy is also component based. You will be able to control which URL component is required to parse the URL, as well as if a default value should be used if a component is missing:

// The default ParseStrategy requires `scheme` and `host` to exists in the URL, and
// it parses URLs leniently.
var strategy = URL.ParseStrategy()
try? strategy.parse("https://www.example.com/path") // returns an URL instance
try? strategy.parse("www.apple.com") // returns `nil` because scheme is missing

// Configure whether each component is required
strategy = URL.ParseStrategy(scheme: .required, host: .required, query: .required)
try? strategy.parse("https://www.cherry.com/path?name=peach") // returns a URL instance
try? strategy.parse("https://www.mango.com/path") // returns `nil` because query is missing

// Configure default values
strategy = URL.ParseStrategy(
    scheme: .defaultValue("https"),
    port: .defaultValue(8080))
// In this case:
// - The scheme is missing so `ParseStrategy` will use the default value "https"
// - The port is not missing, so `ParseStrategy` will keep the original value
try? strategy.parse("www.orange.com:1234") // returns an URL instance: https://www.orange.com:1234
// In this case both fields are missing so `ParseStrategy` will use the default values for both
try? strategy.parse("www.strawberry.com") // returns an URL instance: https://www.strawberry.com:8080

// Error handling
do {
    strategy = URL.ParseStrategy()
    let _ = try strategy.parse("www.blueberry.com")
} catch {
    print(error)
    // Error Domain=NSCocoaErrorDomain Code=2048 "Cannot parse
    // www.blueberry.com. String should adhere to the preferred format
    // of the locale, such as https://www.example.com/path." UserInfo=
    // {NSDebugDescription=Cannot parse www.apple.com. String should
    // adhere to the preferred format of the locale, such as
    // https://www.example.com/path.}
}

You can also use URL.ParseStrategy within a Regex builder (from the new String Pattern Matching library) directly to match URLs:

let text = "https://www.example.com/path 2022-03-014T16:20:32Z"
let regex = Regex {
    capture(.url)    
    capture(.iso8601Date)
}
let match = text.match(regex)
// Both the URL and the Date should be automatically captured and parsed.
print(match.match) // ("https://www.example.com/path 2022-03-014T16:20:32Z", URL, Date)

We are also extending the URLComponents' parser to support Internationalized Domain Names (IDNs) as part of this effort. You will now be able to parse, match, and construct URLs with non-ASCII host names like these:

let strategy = URL.ParseStrategy()
// Yes, these are real URLs in use :P
try? strategy.parse("https://👁👄👁.fm") // returns an URL instance

let regex = Regex {
    capture(.url)
}
print("https://🐮.ws".match(regex).match) // ("https://🐮.ws", URL)

let components = URLComponents(string: "http://見.香港/")
print(components.url) // returns an URL instance

Detailed Design

Formatting URLs

We propose to introduce URL.FormatStyle to describe options for formatting an URL instance into user-visible strings:

extension URL {
    /// Strategies for formatting an `URL`.
    @available(TBD)
    public struct FormatStyle : Codable, Hashable, Sendable {
        /// The strategy to display the `scheme` component.
        var scheme: ComponentDisplayStrategy
        /// The strategy to display the `user` component.
        var user: ComponentDisplayStrategy
        /// The strategy to display the `password` component.
        var password: ComponentDisplayStrategy
        /// The strategy to display the `host` component.
        var host: HostDisplayStrategy
        /// The strategy to display the `port` component.
        var port: ComponentDisplayStrategy
        /// The strategy to display the `path` component.
        var path: ComponentDisplayStrategy
        /// The strategy to display the `query` component.
        var query: ComponentDisplayStrategy
        /// The strategy to display the `fragment` component.
        var fragment: ComponentDisplayStrategy

        /// Creates a new `FormatStyle` with the given configurations.
        /// - Parameters:
        ///   - scheme: The strategy to use for formatting the `scheme`.
        ///   - user: The strategy to use for formatting the `user`.
        ///   - password: The strategy to use for formatting the `password`.
        ///   - host: The strategy to use for formatting the `host`.
        ///   - port: The strategy to use for formatting the `port`.
        ///   - path: The strategy to use for formatting the `path`.
        ///   - query: The strategy to use for formatting the `query`.
        ///   - fragment: The strategy to use for formatting the `fragment`.
        public init(
            scheme: ComponentDisplayStrategy = .always,
            user: ComponentDisplayStrategy = .never,
            password: ComponentDisplayStrategy = .never,
            host: HostDisplayStrategy = .always,
            port: ComponentDisplayStrategy = .omitIfHTTPFamily,
            path: ComponentDisplayStrategy = .always,
            query: ComponentDisplayStrategy = .never,
            fragment: ComponentDisplayStrategy = .never)
    }
}

You can use a modifier syntax to customize an URL.FormatStyle:

let style = URL.FormatStyle()
    .scheme(.omitIfHTTPFamily)
    .user(.never)
    .password(.never)
    .host(.omitSpecificSubdomains(["www", "mobile", "m"]))
    .port(.omitIfHTTPFamily)
    .path(.always)
    .query(.never)
    .fragment(.never)
let url = URL("https://charles:pa$$word@www.pear.com:1234/search?color=blue#price")
print(url.formatted(style)) // pear.com/search

These URL component modifiers are defined as follows:

@available(TBD)
extension URL.FormatStyle {
    public func scheme(_ strategy: ComponentDisplayStrategy = .always) -> Self
    public func user(_ strategy: ComponentDisplayStrategy = .never) -> Self
    public func password(_ strategy: ComponentDisplayStrategy = .never) -> Self
    public func host(_ strategy: HostDisplayStrategy = .always) -> Self
    public func port(_ strategy: ComponentDisplayStrategy = .omitIfHTTPFamily) -> Self
    public func path(_ strategy: ComponentDisplayStrategy = .always) -> Self
    public func query(_ strategy: ComponentDisplayStrategy = .never) -> Self
    public func fragment(_ strategy: ComponentDisplayStrategy = .never) -> Self
}

URL.FormatStyle.Component

We want to support the notion of "conditionally apply" the display strategy to each component. For example, you may choose to hide the scheme if it's in the HTTP family (http or https); or you may choose to display the port if it's not one of the known ports. In order to achieve this goal, we need to first define the list of components:

extension URL.FormatStyle {
    @available(TBD)
    public enum Component : Int, Codable, Hashable, Sendable, CustomStringConvertible {
        case scheme
        case user
        case password
        case host
        case port
        case path
        case query
        case fragment

        public var description: String
    }
}

ComponentDisplayStrategy and HostDisplayStrategy will use URL.FormatStyle.Component to create conditional strategies.

ComponentDisplayStrategy and HostDisplayStrategy

The display strategies that URL.FormatStyle directly uses comes in two versions:

  • HostDisplayStrategy is a specialized version for the host component. It comes with additional formatting features specifically for hosts.
  • ComponentDisplayStrategy is the generic version used by all other components. It simply represents whether a component should be displayed or omitted given a condition.
extension URL.FormatStyle {
    /// Specifies the display strategy for a component, including whether to display or omit the
    /// component and the condition to do so.
    @available(TBD)
    public struct ComponentDisplayStrategy : Codable, Hashable, CustomStringConvertible, Sendable {
        public var description: String

        /// Creates a display strategy to always display the component.
        public static var always: Self

        /// Creates a display strategy to always omit the component.
        public static var never: Self

        /// Creates a display strategy to display the component when the component meets the requirements
        public static func displayWhen(_ component: URL.FormatStyle.Component, matches requirements: Set<String>) -> Self

        /// Creates a display strategy to omit the component when the component meets the requirements
        public static func omitWhen(_ component: URL.FormatStyle.Component, matches requirements: Set<String>) -> Self

        /// Creates a display strategy to omit the component when the URL's scheme
        /// is `http` or `https`.
        public static var omitIfHTTPFamily: Self
    }
}

Here are some examples:

let style: URL.FormatStyle = .init()
    // Omits the scheme if it's `http` or `https`
    .scheme(.omitIfHTTPFamily)
    // Only displays the user if it's "Charles"
    .user(.displayWhen(.user, matches: ["Charles"])) 
    // Never omit the password
    .password(.never)
    // Always display the path
    .path(.always)
    // Omit the port if it's either 8080 or 20
    .port(.omitWhen(.port, matches: ["8080", "20"]))

URL("https://tim:pa$$w0rd@www.lychee.com:42/about").formatted(style)
// www.lychee.com:42/about

URL("ftp://charles:pa$$w0rd@www.coconut.com:8080/files").formatted(style)
// ftp://charles@www.coconut.com/files

HostDisplayStrategy adds two additional options to manipulate the subdomains of a host:

  • omitMultiLevelSubdomains omits all additional subdomains if there are more than two subdomains in addition to the top-level-domain (TLD). For example: api.code.developer.apple.com is displayed as developer.apple.com (TLD: "com"), whereas api.code.developer.apple.com.cn is displayed as developer.apple.com.cn (TLD: "com.cn")
  • omitSpecificSubdomains omits the first (the leftmost) subdomain if it's in the given set. For example: if the subdomain to omit is "mobile", then mobile.apple.com will be displayed as apple.com while developer.mobile.apple.com will not be changed.

You can also combine these two options to further clean up the host (see examples below).

extension URL.FormatStyle {
    /// Specifies the display strategy for displaying the host component
    @available(TBD)
    public struct HostDisplayStrategy : Codable, Hashable, CustomStringConvertible, Sendable {
        public var description: String

        /// Creates a display strategy to always display the host.
        public static var always: Self

        /// Creates a display strategy to always omit the host.
        public static var never: Self

        /// Creates a display strategy to display the host if the component matches the requirements
        public static func displayWhen(_ component: URL.FormatStyle.Component, matches requirements: Set<String>) -> Self

        /// Creates a display strategy to omit the host if the component matches the requirements
        public static func omitWhen(_ component: URL.FormatStyle.Component, matches requirements: Set<String>) -> Self

        /// Creates a display strategy to omit the host if the URL's scheme
        /// is `http` or `https`.
        public static var omitIfHTTPFamily: Self

        /// Creates a display strategy to manipulate the subdomains of a host
        /// - Parameters:
        ///   - subdomainsToOmit: specifies a set of subdomains to omit
        ///   - omitMultiLevelSubdomains: if `true`, additional subdomains (subdomains more than 2 + TLDs)
        ///     will be omitted.
        public static func omitSpecificSubdomains(
            _ subdomainsToOmit: Set<String> = Set(),
            includeMultiLevelSubdomains omitMultiLevelSubdomains: Bool = false) -> Self

        /// Creates a display strategy to manipulate the subdomains of a host if
        /// the given component matches the requirements
        /// - Parameters:
        ///   - subdomainsToOmit: specifies a set of subdomains to omit
        ///   - omitMultiLevelSubdomains: if `true`, additional subdomains (subdomains more than 2 + TLDs)
        //      will be omitted.
        ///   - component: the component to check requirements for
        ///   - requirements: the requirements to check
        public static func omitSpecificSubdomains(
            _ subdomainsToOmit: Set<String> = Set(),
            includeMultiLevelSubdomains omitMultiLevelSubdomains: Bool = false,
            when component: URL.FormatStyle.Component,
            matches requirements: Set<String>) -> Self
    }
}

Here are some examples of HostDisplayStrategy that changes how subdomains are displayed:

var style: URL.FormatStyle = .init()
    .scheme(.never)
    // Omit the "www" subdomain if it's the first subdomain
    .host(.omitSpecificSubdomains(["www"]))
URL("https://www.banana.com/about").formatted(style)
// banana.com/about

URL("https://developer.www.banana.com/about").formatted(style)
// developer.www.banana.com/about (not changed because www isn't the first subdomain)

style = style
    // Only omit multi-level subdomains
    .host(.omitSpecificSubdomains([], includeMultiLevelSubdomains: true))
URL("https://api.docs.developers.grapefruit.com/about").formatted(style)
// developers.grapefruit.com/about

style = style
    // Omit "www" and "mobile" subdomains as well as multi-level subdomains
    .host(.omitSpecificSubdomains(["mobile", "www"], includeMultiLevelSubdomains: true))
URL("https://api.docs.mobile.pineapple.com/metal").formatted(style)
// pineapple.com/metal

URL("https://mobile.www.m.pineapple.com/metal").formatted(style)
// m.pineapple.com/metal

style = style
    // Omit the "mobile" subdomain and multi-level subdomains IF the URL is in the HTTP family
    .host(.omitSpecificSubdomains(
        ["mobile"],
        includeMultiLevelSubdomains: true,
        when: .scheme, matches: ["http", "https"]))
URL("https://docs.mobile.lemon.com/page").formatted(style)
// lemon.com/page

URL("ftp://docs.mobile.lemon.com/data").formatted(style)
// docs.mobile.lemon.com/data (subdomains are not modified because the condition is not met)

Note: omitting multi-level subdomains will not be supported on Linux.

The default style to URL.FormatStyle (i.e. when you create a "blank" FormatStyle without any modifications) always displays the host and path, omits the port and scheme if the URL is in the HTTP family, and never displays the rest of the components. We believe this is a sensible default for most use cases.

Finally, to align URL.FormatStyle's API surface with other Foundation vended FormatStyles, we propose to introduce these miscellaneous changes:

  • Introduce two formatted methods on URL that formats the instance with a given style;
  • Extend URL.FormatStyle to conform to ParsableFormatStyle so you will be able to construct an URL.ParseStrategy from a format style;
  • Introduce a static variable url on FormatStyle and ParseableFormatStyle constrainted to Self as shortcuts to initializing the default format style.
extension URL {
    /// Converts `self` to its textual representation.
    /// - Parameter format: The format for formatting `self`.
    /// - Returns: A representation of `self` using the given `format`. The type of
    ///   the representation is specified by `FormatStyle.FormatOutput`
    @available(TBD)
    public func formatted<F: Foundation.FormatStyle>(_ format: F) -> F.FormatOutput where F.FormatInput == URL

    public func formatted() -> String
}

@available(TBD)
extension URL.FormatStyle : ParseableFormatStyle {
    public var parseStrategy: URL.ParseStrategy { get }
}

@available(TBD)
public extension FormatStyle where Self == URL.FormatStyle {
    static var url: Self { get }
}

@available(TBD)
public extension ParseableFormatStyle where Self == URL.FormatStyle {
    static var url: Self { get }
}

Parsing URLs

We propose to introduce URL.PraseStrategy to describe options for parsing an URL string into an URL instance:

extension URL {
    /// Options for parsing string representations of URLs to create an `URL` instance.
    @available(TBD)
    public struct ParseStrategy : Codable, Hashable, Sendable {

        /// The strategy to parse the `scheme` component.
        var scheme: ComponentParseStrategy<String>
        /// The strategy to parse the `user` component.
        var user: ComponentParseStrategy<String>
        /// The strategy to parse the `password` component.
        var password: ComponentParseStrategy<String>
        /// The strategy to parse the `host` component.
        var host: ComponentParseStrategy<String>
        /// The strategy to parse the `port` component.
        var port: ComponentParseStrategy<Int>
        /// The strategy to parse the `path` component.
        var path: ComponentParseStrategy<String>
        /// The strategy to parse the `query` component.
        var query: ComponentParseStrategy<String>
        /// The strategy to parse the `fragment` component.
        var fragment: ComponentParseStrategy<String>

        /// Creates a new `ParseStrategy` with the given configurations.
        /// - Parameters:
        ///   - scheme: The strategy to use for parsing the `scheme`.
        ///   - user: The strategy to use for parsing the `user`.
        ///   - password: The strategy to use for parsing the `password`.
        ///   - host: The strategy to use for parsing the `host`.
        ///   - port: The strategy to use for parsing the `port`.
        ///   - path: The strategy to use for parsing the `path`.
        ///   - query: The strategy to use for parsing the `query`.
        ///   - fragment: The strategy to use for parsing the `fragment`.
        public init(
            scheme: ComponentParseStrategy<String> = .required,
            user: ComponentParseStrategy<String> = .optional,
            password: ComponentParseStrategy<String> = .optional,
            host: ComponentParseStrategy<String> = .required,
            port: ComponentParseStrategy<Int> = .optional,
            path: ComponentParseStrategy<String> = .optional,
            query: ComponentParseStrategy<String> = .optional,
            fragment: ComponentParseStrategy<String> = .optional)
    }
}

@available(TBD)
extension ParseStrategy where Self == URL.ParseStrategy {
    public static var url: Self
}

@available(TBD)
extension URL {
    /// Creates a new `URL` by parsing the given representation.
    /// - Parameters:
    ///   - value: A representation of an URL. The type of the representation is specified
    ///     by `ParseStrategy.ParseInput`.
    ///   - strategy: The parse strategy to parse `value` whose `ParseInput` is `URL`.
    public init<T: Foundation.ParseStrategy>(_ value: T.ParseInput, strategy: T) throws where T.ParseOutput == Self
}

You can use a modifier syntax to customize an URL.ParseStrategy (similar to URL.FormatStyle):

let strategy = URL.ParseStrategy()
    .scheme(.defaultValue("https"))
    .user(.optional)
    .password(.optional)
    .host(.required)
    .port(.defaultValue(8080))
    .path(.optional)
    .query(.optional)
    .fragment(.optional)
let text = "www.watermelon.com/about"
let url = try? strategy.parse(text) // https://www.watermelon.com:8080/about

These modifiers are defined as follows:

@available(TBD)
extension URL.ParseStrategy {
    public func scheme(_ strategy: ComponentParseStrategy<String> = .required) -> Self
    public func user(_ strategy: ComponentParseStrategy<String> = .optional) -> Self
    public func password(_ strategy: ComponentParseStrategy<String> = .optional) -> Self
    public func host(_ strategy: ComponentParseStrategy<String> = .required) -> Self
    public func port(_ strategy: ComponentParseStrategy<Int> = .optional) -> Self
    public func path(_ strategy: ComponentParseStrategy<String> = .optional) -> Self
    public func query(_ strategy: ComponentParseStrategy<String> = .optional) -> Self
    public func fragment(_ strategy: ComponentParseStrategy<String> = .optional) -> Self
}

ComponentParseStrategy

URL.ParseStrategy uses ComponentParseStrategy (formally URL.ParseStrategy.ComponentParseStrategy) to determine the rules to parse each URL component:

extension URL.ParseStrategy {
    /// Specifies the strategy to use to parse each component.
    @available(TBD)
    public enum ComponentParseStrategy<Component : Codable & Hashable & Sendable> : Codable, Hashable, CustomStringConvertible, Sendable {
        /// Denotes that the component is required to exists in order to consider the URL valid
        case required
        /// Denotes that the component is optional
        case optional
        /// If the component is missing, assume it has the attached default value
        case defaultValue(Component)

        public var description: String
    }
}

In addition to the standard .required and .optional case, ComponentParseStrategy also provides a third case .defaultValue(Component) that allows the developers to specify a default value for each component. This option is especially useful when the data being parsed is known to miss certain fields (most commonly the scheme). Here are some examples:

let strategy: URL.Strategy = .init()
    // When the URL does not have scheme, use "http" as the scheme
    .scheme(.defaultValue("http"))
    // When the URL does not have the port value, use the default `80` port
    .port(.defaultValue(80))

// The returned URL will already have the default values filled in
try? strategy.parse("www.lychee.com") // http://www.lychee.com:80

// `ParseStrategy` will only fill in the missing values. In this case
// it will only fill in the scheme
try? strategy.parse("www.gooseberry.com:8090") // http://www.gooseberry.com:8090

URL.ParseStrategy in String Processing

URL.ParseStrategy will also participate in Regex powered String Processing as one of the CustomMatchingRegexComponent:

@available(TBD)
extension URL.ParseStrategy : CustomMatchingRegexComponent {
    typealias Match = URL
}

We will also extend the RegexProtocol with two static members as the dot syntax (.url) shortcuts to URL.ParseStrategy:

extension RegexProtocol where Self == URL.ParseStrategy {
    /// Creates a parse strategy with default configurations
    public static var url: Self

    /// Creates a custom parse strategy with given required components
    public static func url(
        scheme: ComponentParseStrategy<String> = .required,
        user: ComponentParseStrategy<String> = .optional,
        password: ComponentParseStrategy<String> = .optional,
        host: ComponentParseStrategy<String> = .required,
        port: ComponentParseStrategy<Int> = .optional,
        path: ComponentParseStrategy<String> = .optional,
        query: ComponentParseStrategy<String> = .optional,
        fragment: ComponentParseStrategy<String> = .optional) -> Self
}

Please refer back to the Motivation section for some example usages of URL parsing and string matching.

Note: the default configuration for URL.ParseStrategy requires the scheme and host to exist to consider a string a valid URL. It's important to have some requirements when parsing URLs because the URL standard is pretty broad. Many seemly "not an URL" strings, such as Foundation.framework, or simply just Foundation, can be considered as valid URLs. As a result, component requirements are essential when using URL.ParseStrategy to perform pattern matching -- a "no requirement" strategy will simply match any string up to the next whitespace.

Extending URLComponents to Support Internationalized Domain Names

We posted a proposal on the Swift forums and you told us you'd like to update URL and URLComponents' parser to support Internationalized Domain Names (IDNs) such as http://見.香港/, or http://இலங்கை.icom.museum. We want to take this opportunity to update URLComponents' parser (since it's more modern than URL's parser) to support these domain names with automatic Punycode encoding. As a result, URLComponents.percentEncodedHost no longer makes sense because IDNs must be Punycode encoded instead of percent-encoded. We propose to introduce a new property URLComponents.encodedHost to allow get and set of Punycode encoded host and soft-deprecate URLComponents.percentEncodedHost:

public struct URLComponents {
    ...
    @available(macOS, introduced: 10.9, deprecated: 100000.0, message: "Use encodedHost instead")
    @available(iOS, introduced: 7.0, deprecated: 100000.0, message: "Use encodedHost instead")
    @available(tvOS, introduced: 9.0, deprecated: 100000.0, message: "Use encodedHost instead")
    @available(watchOS, introduced: 2.0, deprecated: 100000.0, message: "Use encodedHost instead")
    public var percentEncodedHost: String?

    @available(TBD)
    public var encodedHost: String?
}

Here are some examples of Punycode encoded hosts:

var urlComponents = URLComponents(string: "http://見.香港")!
print(urlComponents.host) // 見.香港
print(urlComponents.encodedHost) // xn--nw2a.xn--j6w193g

// Setting raw host
urlComponents = URLComponents()
urlComponents.scheme = "https"
urlComponents.host = "👁👄👁.fm"

print(urlComponents.encodedHost) // xn--mp8hai.fm
print(urlComponents.string) // https://xn--mp8hai.fm

// Setting encoded host
urlComponents = URLComponents()
urlComponents.scheme = "https"
urlComponents.encodedHost = "xn--2o8h.ws"

print(urlComponents.host) // 🐮.ws
print(urlComponents.string) // https://xn--2o8h.ws

Note: we decided to not update URL's parser for backward compatibility. This means you can only construct an URL with IDNs via URLComponents. Constructing internationalized URLs via URL's constructor directly (URL("https://👁👄👁.fm")) will still return nil.

Impact on Existing Code

Minimal. This proposal mostly contains additive API Surface changes except URLComponents' IDN adoption.

Alternatives Considered

None.

12 Likes

Would this proposal apply to swift-corelibs-foundation as well?

I've wanted to use the new formatting APIs introduced last year on Linux, but it seems swift-corelibs-foundation doesn't expose any of the new FormatStyle-based formatting APIs.

Is this something in the works, or perhaps a technical limitation?

It's simply a matter of doing the work. However, since the behavior of swift-corelibs-foundation needs to match Apple's Foundation, and only Apple has access to the actual code there, it's usually something Apple needs to do, and they don't. So... :man_shrugging:

Yes, this feature is planned to be part of swift-corelib-foundation with one caveat: the "omit multi-level subdomains" feature from URL.FormatStyle won't be available on Linux.

2 Likes

I'm sceptical that the formatting capabilities will really do what users want in a robust, secure way. It seems too low-level, and delegates the difficult decisions to developers who are not necessarily experts or informed to make those decisions.

The proposed parsing behaviour is pretty weird and will cause Foundation's to deviate further from the rest of the world. I'm not really sure what else to say about it; it's hard to say what it's for. If it is intended to be used with UI text entry, it's nowhere near what browsers do to fix-up dodgy human-entered URLs, and probably shouldn't be as configurable as it is.

It's a worthwhile thing to tackle, but I'm not sure this is it. Also, throwing in IDNA as a bonus part of an otherwise unrelated proposal is a bit jarring.


Err, I don't think it's a good idea for url.formatted() to hide all of this stuff. For one thing, the relevant information in a URL is scheme-specific - it is difficult to make sweeping judgements on the URL level because you don't know how it will ultimately be interpreted to identify something. Keep in mind that the "U" in "URL" stands for "Universal"; people can do whatever they want with it.

  • For non-http URL schemes (ftp://, smb://, telnet:// ssh://, svn://, git://, mongodb:// etc), the username and password may indeed be important.

  • For http/s, the username and password are often not even sent by the client (e.g. browsers strip them), so hiding them may be reasonable if URLSession does the same. I don't know if it does.

  • The query and fragment can also be important, even for http/s. It would be strange to have 100 URLs which are all different, but when formatted they all produce exactly the same string: "https://google.com/search".

    It could also be unwise to format URLs such as "https://shop.com?action=makePayment" or "?action=resetPassword" to only display as "https://shop.com".

  • The port is worth keeping, even for http/s URLs. I think the criteria you're looking for is "default port" (e.g. a URL which explicitly says HTTP on port 80, HTTPS on port 443, or FTP on port 21, etc). Non-default ports should always be printed, IMO (e.g. "http://localhost:8080/foo").

    I can't think of a case where you're making an HTTP request to a non-default port and shouldn't display the port number, or why it would be different for http/s as opposed to other schemes.

  • Does required actually affect parsing, or is it just a post-condition? i.e. "parse this thing normally, but if the result doesn't satisfy this predicate, discard it and consider it no match"?

    If so, it seems like a generic thing that should be made available to all regex components. What if I have more detailed requirements? For example, perhaps I need the scheme to not only exist, but have a specific value (e.g. I only want to parse redis:// URLs), or I only want to match if the hostname has a specific eTLD.

  • Can I really have any combination of default values? Can I set a default scheme, host, port and path, and parse a lone query string? Or can I just parse a lone number and it considers that to be a port string?

    What is quite interesting is that "www.orange.com:1234" is actually a perfectly valid URL. Browsers and other URL libraries will interpret that string as a URL with opaque path:

                            VVVV - path
             www.orange.com:1234
    scheme - ^^^^^^^^^^^^^^
    
    This is similar to other URLs with opaque paths, for example:
    
                        VVVVVVVVVVVVVV - path
             javascript:alert("hello")
    scheme - ^^^^^^^^^^
    

    So it is surprising that Foundation thinks this string is actually lacking a scheme and that it will insert one.

    I can see this being an enormous headache if anybody builds their application to rely on it. It is very difficult to interpret URLs with missing pieces the way this feature is trying to, and that interpretation won't be portable or interoperable with any other systems (which will interpret the same strings as meaning very, very different things).

    This sort of parsing should also be performed by users only with the utmost care. It is very easy to write security vulnerabilities by overusing this kind of thing.

If it only works with URLComponents, doesn't that mean that it won't work with any of this URL.ParseStrategy stuff?

There seems to be a lack of a clear vision. Is the Foundation team recommending we use URLComponents for IDNA and because it has a more modern (albeit still not web-compatible) parser? Or are they recommending this new parsing strategy stuff (which has no IDNA support)?

Also, there needs to be more information on IDN support. There is a reference to punycode specifically, but IDNA also involves Unicode normalisation (of course, since you're encoding Unicode data as an ASCII string), case-folding, versions and compatibility modes, etc. What exactly is being proposed?

// What is the result of this?
URLComponents(string: "http://www。apple。com/") == URLComponents(string: "http://www.apple.com/")

Are there any APIs for encoding/decoding lone hostnames apart from URLComponents?

The naming here also seems confused. Why does a URL have a format style, but its components only have display strategies?

It seems the word "strategy" is a relatively new-ish trend for Foundation APIs. IMO it makes the framework feel inconsistent - previous APIs used the word "options" (e.g. Data.ReadingOptions, Data.WritingOptions, Data.Base64DecodingOptions, Data.SearchOptions, String.CompareOptions, String.EncodingConversionOptions, etc).

The new AttributedString API also has AttributedString.MarkdownParsingOptions, AttributedString.FormattingOptions, and AttributedString.InterpolationOptions. But then it also uses -Configuration for Codable-related stuff, so... :man_shrugging:

It would be nice to have greater consistency.

Yeah... I just see this and think of how easy it will be for the average developer to accidentally mislead their users, or how many phishing attacks will be developed to exploit gaps in this logic.

Formatting URLs so that even non-expert users can make informed security/privacy decisions is known to be a Very Hard Problem™. Browser developers spend a lot of time trying to figure that out, and even they are far from perfect. A library solution is warranted, for sure, but I'm concerned about exposing this level of granularity. I suspect that toggling individual URL components is too low-level for most developers.

I hope this is doing actual public suffix list/same-site matching, not just "omits all additional subdomains if there are more than two subdomains in addition to the top-level-domain (TLD)".

Otherwise, you know... "karl.developer.apple.com" would also display as "developer.apple.com". So would "alice.developer.apple.com", "bob.developer.apple.com", "malicious-user.developer.apple.com", etc.

Presumably that means yes (i.e. it is doing real PSL matching, like a browser would)?

7 Likes

why is this API not available on linux?

1 Like

Giving the developer full control over how to format an input (in this case an URL instance) is the design principle behind the various FormatStyles that Foundation provides. Our goal is to make the FormatStyle APIs as flexible as possible so they can be applied to as many uses cases as possible. It's certainly difficult, if not impossible, to provide a higher-level API with fewer nobs that fit in all use cases. For example, Safari might decide to hide everything except the host while an FTP client might decide to show the username and path.

It's very difficult to provide a "default" format style for URL because there simply isn't a universally accepted one like you suggested. I considered making the default style not omit anything, but this approach would render formatted() noop -- it would be as if the user is calling url.absoluteString. Please let me know if you have a default format style in mind!

I whole heartily agree. This is why we support conditions such as "hide the username only if the scheme is http or https: formatStyle.user(.omitWhen(.scheme, matches: ["http", "https"])). There isn't a universal way to format a URL, which is why we encourage the developers to configure the format style to fit their specific needs.

I like this idea. Thank you!

The intended use case for .defaultValue is to allow developers to specify a value to be included in the constructed URL instance if it's known that the string being parsed is missing some information needed. This way the developers wouldn't have to manually modify these URL instances to add the missing info afterward. These default values are not placeholders -- you won't be able to "place hold" other components and then just parse a number as the port.

Yes, you'll be able to use URL.ParseStrategy with IDNs:

let strategy = URL.ParseStrategy()
// Yes, these are real URLs in use :P
try? strategy.parse("https://👁👄👁.fm") // returns an URL instance

Thanks for the suggestion. I agree that having displayStrategy under FormatStyle is confusing. I'll change ComponentDisplayStrategy and HostDisplayStrategy to ComponentDisplayOption and HostDisplayOption.

Could you elaborate on this? Without being able to configure each component, what are some other ways for developers to influence the formatting output?

Yes, this feature is performing public suffix list matching to determine the TLDs.

This feature is backed by a large data file (the public suffix list to look up known top level domains) and we aren't sure yet the best way to deliver it on Linux without a large increase to the size of the library. It's not a final decision and we can revisit it later.

that’s unfortunate. i remember being told when Foundation was first open-sourced that the gap between the macOS and linux APIs would gradually close over time. instead it seems like it’s getting even worse…

4 Likes

It would be a good idea for a standalone package, IMO (EDIT: There are some). You really want to ensure that everybody who might need this data is seeing the same version of it, so they should all use the same library if possible. It's a really useful thing for both clients and servers though.

The list isn't that big - 12K lines (GitHub's "SLOC" metric), and each line is usually a really tiny string. There are lots of repeated patterns as well, so you could probably generate some kind of efficient table from it.

If the host display strategy/options allowed for an arbitrary predicate closure, a package could probably vend a Foundation integration library (cross-import overlay...?) which polyfilled this API on platforms without a system-wide source, but in a way that is opt-in and less monolithic - that's what everybody always asks for, right?

6 Likes