Pitch(Foundation): String Encoding Names

Updated the pitch to go forward:

I'm glad if @itingliu takes a look at this pitch. I would appreciate your help.


String Encoding Names

  • Proposal: Not assigned yet
  • Author(s): YOCKOW
  • Review Manager: TBD
  • Status: Pitch

Introduction

This proposal allows String.Encoding to be converted to and from various names.

For example:

// Based on IANA registry
print(String.Encoding.utf8.charsetName!) // Prints "UTF-8"
print(String.Encoding(charsetName: "ISO_646.irv:1991") == .ascii) // Prints "true"

// Based on WHATWG Living Standard
print(String.Encoding.macOSRoman.standardName!) // Prints "macintosh"
print(String.Encoding(standardName: "us-ascii") == .windowsCP1252) // Prints "true"

Motivation

String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as Content-Type: text/plain; charset=UTF-8 or in XML documents with declarations such as <?xml version="1.0" encoding="Shift_JIS"?>.

Therefore, it is necessary to parse and generate such names.

Current solution

Swift lacks the necessary APIs, requiring the use of CoreFoundation (hereinafter called "CF") as described below.

extension String.Encoding {
  var nameInLegacyWay: String? {
    // 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
    //    NOTE: The raw value of `String.Encoding` is the same with the value of `NSStringEncoding`,
    //          while it is not equal to the value of `CFStringEncoding`.
    let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)

    // 2. Convert it to the name where its type is `CFString?`
    let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)

    // 3. Convert `CFString` to Swift's `String`.
    //    NOTE: Unfortunately they can not be implicitly casted on Linux.
    let charsetName: String? = cfStrEncName.flatMap {
      let bufferSize = CFStringGetMaximumSizeForEncoding(
        CFStringGetLength($0),
        kCFStringEncodingASCII
      ) + 1
      let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
      defer {
        buffer.deallocate()
      }
      guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else {
        return nil
      }
      return String(utf8String: buffer)
    }
    return charsetName
  }

  init?(fromNameInLegacyWay charsetName: String) {
    // 1. Convert `String` to `CFString`
    let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
      return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII)
    }

    // 2. Convert it to `CFStringEncoding`
    let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)

    // 3. Check whether or not it's valid
    guard cfStrEncValue != kCFStringEncodingInvalidId else {
      return nil
    }

    // 4. Convert `CFStringEncoding` value to `String.Encoding` value
    self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
  }
}

What's the problem of the current solution?

  • It is complicated to use multiple CF-functions to get a simple value. That's not Swifty.
  • CF functions are legacy APIs that do not always fit with modern requirements.
  • CF APIs are officially unavailable from Swift on non-Darwin platforms.

Proposed solution

The solution is straightforward.
We introduce computed properties that return the name, and initializers that create an instance from the name as shown below.

extension String.Encoding {
  /// Returns the name of the IANA registry "charset" that is the closest mapping to this string
  /// encoding.
  public var charsetName: String? { get }

  /// Creates an instance from the name of the IANA registry "charset".
  public init?(charsetName: String)

  /// Returns the name of the WHATWG encoding that is the closest mapping to this string encoding.
  public var standardName: String? { get }

  /// Creates an instance from the name of the WHATWG encoding.
  public init?(standardName: String)
}

Detailed design

This proposal refers to "Character Sets" published by IANA and to "The Encoding Standard" published by WHATWG. While the latter may claim the former could be replaced with it, it entirely focuses on Web browsers (and their JavaScript APIs).

As shown in String.Encoding-Name conversion graph below, they are incompatible, making it difficult to compromise. Although you may want to ask which is better, the choice of which to use depends on your specific needs[1]. Since Swift APIs should be more universal, here we consult both.

Graph of Encodings ↔︎ Names
The graph of String.Encoding-Name conversions
[2]

String.Encoding to Name

  • Upper-case letters may be used unlike CF.
    • charsetName returns Preferred MIME Name or Name of the encoding defined in "IANA Character Sets".
    • standardName returns Name of the encoding defined by "The Encoding Standard".
  • String.Encoding.shiftJIS.charsetName[standardName] returns "Shift_JIS" since "CP932" is no longer available for a name of any encodings.

Name to String.Encoding

  • init(charsetName:) adopts "Charset Alias Matching" defined in UTX#22.
    • i.g., "u.t.f-008" is recognized as "UTF-8".
  • init(charsetName:) behaves consistently about ISO-8859-*.
    • For example, CF inconsistently handles "ISO-8859-1-Windows-3.1-Latin-1" and "csWindows31Latin1".
  • init(standardName:) adopts case-insensitive comparison described in §4.2. Names and labels of The Encoding Standard.

Source compatibility

These changes proposed here are only additive. However, care must be taken if migrating from CF APIs.

Implications on adoption

This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.

Future directions

String.init(data:encoding:) and String.data(using:) will be implemented more appropriately[3].

Alternatives considered

Expose APIs only for the IANA Character Sets

Modern Web browsers have unfortunately deviated from the IANA's charset list. That means that it is better to adhere to the WHATWG Encoding Standard if you handle mainly web contents. We often require "The Living Standard" to cover such use cases.

Expose APIs only for the WHATWG Encoding Standard

As mentioned above, the WHATWG Encoding Standard focuses on latest Web browsers. This can cause issues in some cases.

Imagine handling an XML 1.1 file declaring that its encoding is "ISO-8859-1": <?xml version="1.1" encoding="ISO-8859-1"?>. What if that file contains a byte 0x85? 0x85 is recognized as U+0085(NEL) in ISO-8859-1 which is a valid end-of-line character in XML 1.1[4].

On the other hand, the WHATWG Encoding Standard argues that "ISO-8859-1" label must be resolved as "windows-1252". A byte 0x85 is decoded to U+2026(Horizontal Ellipsis) in windows-1252 and that may cause fatal error to parse the XML file.

In such cases, consulting the IANA registry is necessary.

Consolidate them

We might be able to consolidate them into a single kind of API like this:

extension String.Encoding {
  public var name: String? { get }
  public init?(name: String)
}

However, this approach would be too arbitrary and too difficult to maintain consistent behavior.

Follow Locale as precedent

Locale has an enum named IdentifierType to specify which kind of identifier should be used.
We can apply that way to String.Encoding:

extension String.Encoding {
  public enum NameType {
    case iana
    case whatwg
  }
  public var name(_ type: NameType) -> String?
  public init?(name: String, type: NameType)
}

Acknowledgments

Thanks to everyone who gave me advices on the pitch thread; especially to @benrimmington and @xwu who could channel their concerns into this proposal in the very early stage.


  1. You may just want to parse an old XML document on local. ↩︎

  2. Foundation assumes UTF-16 without BOM is big endian when decoding. ↩︎

  3. ☂️ `String.init(data:encoding:)`/`String.data(using:)` regression. · Issue #1015 · swiftlang/swift-foundation · GitHub ↩︎

  4. Extensible Markup Language (XML) 1.1 (Second Edition) ↩︎

I did not follow every discussion about the intricacies of mapping charsets in the previous versions of the pitch, but I think the proposed API addition in the latest version is useful and relevant.

Nitpick: You can still use CF API through swift-corelibs-foundation on non-Darwin platforms. But yes, that is not Swift, and you're right that it's useful to have Swift parity.

You have this in your Alternatives Considered section, which @jmschonfeld also mentioned before, to have something like this. What's the reason that you're not pursuing this idea?

Although I do like having all initializers laid out, and only reserve the enum NameType for getters, i.e. something like this. What do you think?

extension String.Encoding {
    public init?(charsetName: String)
    public init?(standardName: String)
    public enum NameType {
        case iana
        case whatwg
    }
    public var name(_ type: NameType) -> String?
}

While we're at it, is charsetName and standardName the standard way to refer to IANA and WHATWG standards? Would it be more straightforward to just name the arguments as

public init?(ianaName: String)
public init?(whatwgName: String)
1 Like

Thank you so much for your reply!


To be honest, there was no significant rationale.
I thought it was better to discuss if we had some concrete (sample) implementation, then I had chosen the one just described in the previous pitch.

I had no idea that we could have getters-initializers asymmetry, but I think now it's the better or the best way to realize this feature. Thank you for your input.

That's certainly easy to understand.
(One possible alternative could be ianaName↔︎ianaCharsetName, whatwgName↔︎whatwgEncodingName...redundant?)


I'm sorry but the reason why my expression was as such was in the context of the other thread. A footnote should have been added.

The Context

I would suggest even shorter labels:

public init?(iana: String)
public init?(whatwg: String)

There’s no reason here to repeat “encoding” (or “charset” in IANA parlance, which at worst could be confused with CharacterSet) given the type being initialized, and it’s not plausible for the sole string argument to be anything but the name of the encoding.

5 Likes

Thanks to @itingliu and @xwu, I've updated the pitch:

What's changed?

  • Introduce enum NameType.
    • Getter is now a function that takes an instance of NameType
  • All initializers with short labels are laid out for each type.
  • Some notions are added to "Detailed design".

Two points:

  1. Referring to WHATWG standards is interesting, because unlike standards bodies such as the IETF, the WHATWG make standards for the web platform only. The group's motto is "leave your sense of logic at the door", because very often its standards reflect the messy reality of needing to reconcile incompatible behaviours implemented by historical browsers rather than being a good or even logical design.

    I have been contributing to the WHATWG URL standard for many years, and I have found this point is frequently misunderstood by outsiders, who can become frustrated when attempting to use these standards in non-web contexts. They generally aren't designed for that.

    When it comes to URLs, my personal opinion is that most people probably expect other applications and libraries to process them as browsers do, so having identical behaviour with the web platform is probably reasonable. For string encodings? My feeling is that compatibility with the web platform may be more of a niche requirement that is better served by a package; I don't know, but that's how I would frame the question.

  2. Long-term, I hope String.Encoding just goes away entirely, replaced by conformances to Unicode.Encoding/_UnicodeEncoding (it's named that way because protocols couldn't be nested. It's public but DocC doesn't generate documentation for it :frowning_face:).

    One of the goals of Unicode was to replace all of these legacy encodings with a universal character set, and the standard includes a lot of redundant encodings to ensure lossless round-tripping. In my opinion, dealing with legacy text encodings is part of making Swift a great language for text processing and our embrace of Unicode.

    For String especially, Foundation has many APIs which essentially duplicate functionality from the standard library. If the standard library's interfaces need revising for Foundation to be able to use them, we should do so.

3 Likes

Taking as given that Foundation offers non-Unicode string encodings, what uses for those APIs would you have in mind that would be more common/less niche than Web-compatible ones?

Well, that is essentially the same question, and I already said I don't know. Legacy encodings are niche even on the web these days, to be honest.

But I can imagine a situation where Foundation provides the legacy encodings, while the actual WHATWG Encoding Standard-related stuff is left to a third-party package which provides the relevant interfaces. Why would that be insufficient?

Minor feedback and questions:

  • Why is .utf16 nested within a .unicode enum? This seems to be unique among all values of String.Encoding. In particular, why are .utf32, .utf{16,32}LittleEndian and .utf{16,32}BigEndian top-level enum cases?
  • .symbol should probably be .macSymbol to avoid confusion.
  • I would argue .macOSRoman is better known as .macRoman.
  • .nextstep should be capitalized .nextStep. :wink:
  • UCS-2 is not UTF-16. UCS-2 does not have surrogate pairs. There should be a separate .ucs2 enum, and attempting to convert a string with characters outside the BMP to .ucs2 should fail.
  • Why only specify a subset of Windows codepages? Interaction with legacy software/file formats might require using arbitrary Windows codepages.

Mmm, given that NaN is spelled .nan, it seems most consistent that NeXTSTEP is spelled .nextstep.

I put the winky face there because the capitalization of “NEXTSTEP” changed multiple times over the course of the company’s existence. :)

1 Like

If you're referring to the proposed behavior when specifying init(whatwg: "ucs-2"), the Encoding Standard specifies explicitly that it should return UTF-16LE (likewise also for "unicode"):

I’m referring to the enum cases (left column of boxes whose labels have leading dots) shown in the chart.

Ah there are so many tables in the form of photos in this thread that aren’t easily inspectable: I agree that enum cases should only have the canonical names rather than compatibility labels, which can be misleading.

If we literally can’t think of a use case that isn’t Web-compatible, then the question answers itself, no?

Are you asking Karl to justify the presence of legacy encodings at all? Or just the deference to existing API semantics over the WHATWG’s compatibility table?

For what it’s worth, I’ve worked on software that has used legacy encodings to read files created on big-endian 68k Macintoshes and NeXT boxen, and I have never once referred to the WHATWG for guidance on how to handle those scenarios. The WHATWG codified existing browser behavior; they didn’t try to design an API for people who actually know what they’re requesting.

1 Like

Perhaps I'm misunderstanding what @Karl is saying, but my read is that he buys the rationale for having legacy encoding support in Foundation, but thinks that library shouldn't offer a way to initialize them based on WHATWG name-to-encoding mappings because:

...to which my question was: to which other platform or use case would the legacy encodings be less of a niche requirement? Since we cannot seem to name any use cases between us, isn't it the case (by construction) that the Web platform isn't "more" niche?

But had you done so, would WHATWG mappings have helped, harmed, or made no difference (i.e., was this use case "Web-compatible" or incompatible)? In other words, is this really an argument that these APIs ought to be made more in reach, and not that they are irrelevant?

Thank you for the suggestion. However, their spellings are not my idea. They are already exposed as official API:

To focus on the name-value conversions, the (latest version of the) pitch doesn't change them, nor add new properties.

Legacy encodings vs WHATWG encodings...

One example that may cause issues is described in the pitch: handling an XML 1.1 file declaring that its encoding is "ISO-8859-1": <?xml version="1.1" encoding="ISO-8859-1"?>.

We can't decide which encodings are better. Persons who should decide which to use are not library authors, but library users, I guess.

Unusual casing in company and product names are known as a stylized typographic text logo and is a way for companies to stand out in running text in media, etc. Some companies also use glyph substitution where they replace a latin letter with a similar-looking cyrillic letter or other unicode glyph to make their name stand out. It can be mirroring of letters — or, as in this case — unusual casing.

We should not sneak logos into our code base, but use normal casing rules as per the English language (yes, English, as Swift otherwise uses English for keywords, etc).

It is a fine line to balance. I'm sure most people here would write iPhone, and not Iphone, when referring to the product, but also that most people would write Sony and not SONY to refer to the company. (A lot of media outlets has a strict policy of rejecting all special casing including iPhone and nVIDIA, to not fall for private companies marketing tactics of getting free exposure)