Pitch(Foundation): String Encoding Names

YOCKOW · February 10, 2025, 8:08am

Let me share the pitch #5 here.

Moving WHATWG things to the Future Directions, APIs are narrowed down to IANA-based conversions. On the other hand, we don't aim at full compatibility with CF APIs.
I think this is down to earth in a realistic manner for the time being.

String Encoding Names

Proposal: Not assigned yet
Author(s): YOCKOW
Review Manager: TBD
Status: Pitch

Implementation: StringEncodingNameImpl/StringEncodingName.swift

Review: (Pitch)

Revision History

Pitch#1

Features
- Fully compatible with CoreFoundation.
  - Planned to add static properties corresponding to kCFStringEncoding*.
- Spelling of getter/initializer was ianaCharacterSetName.
Pros
- Easy to migrate from CoreFoundation.
Cons
- Propagating undesirable legacy conversions into current Swift Foundation.
- Including string encodings which might not be supported by Swift Foundation.

Pitch#2

Features
- Consulting both IANA Character Sets and WHATWG Encoding Standard.
  - Making a compromise between them.
- Spelling of getter/initializer was name.
Pros
- Easy to communicate with API.
Cons
- Hard for users to comprehend conversions.
- Difficult to maintain the API in a consistant way.

Pitch#3, Pitch#4

Features
- Consulting both IANA Character Sets and WHATWG Encoding Standard.
- Separated getters/initializers for them.
  - #3: charsetName and standardName respectively.
  - #4: name(.iana) and name(.whatwg) for getters; init(iana:) and init(whatwg:) for initializers.
Pros
- Users can recognize what kind of conversions is used.
Cons
- Not reflecting the fact that WHATWG's Encoding Standard doesn't provide only string encoding names but also implementations to encode/decode data.

Pitch#5

This pitch.

Introduction

This proposal allows String.Encoding to be converted to and from various names.

For example:

print(String.Encoding.utf8.name!) // Prints "UTF-8"
print(String.Encoding(name: "ISO_646.irv:1991") == .ascii) // Prints "true"

Motivation

String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as Content-Type: text/plain; charset=UTF-8 or in XML documents with declarations such as <?xml version="1.0" encoding="Shift_JIS"?>.

Therefore, it is necessary to parse and generate such names.

Current solution

Swift lacks the necessary APIs, requiring the use of CoreFoundation (hereinafter called "CF") as described below.

extension String.Encoding {
  var nameInLegacyWay: String? {
    // 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
    //    NOTE: The raw value of `String.Encoding` is the same as the value of `NSStringEncoding`,
    //          while it is not equal to the value of `CFStringEncoding`.
    let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)

    // 2. Convert it to the name where its type is `CFString?`
    let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)

    // 3. Convert `CFString` to Swift's `String`.
    //    NOTE: Unfortunately they can not be implicitly casted on Linux.
    let charsetName: String? = cfStrEncName.flatMap {
      let bufferSize = CFStringGetMaximumSizeForEncoding(
        CFStringGetLength($0),
        kCFStringEncodingASCII
      ) + 1
      let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
      defer {
        buffer.deallocate()
      }
      guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else {
        return nil
      }
      return String(utf8String: buffer)
    }
    return charsetName
  }

  init?(fromNameInLegacyWay charsetName: String) {
    // 1. Convert `String` to `CFString`
    let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
      return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII)
    }

    // 2. Convert it to `CFStringEncoding`
    let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)

    // 3. Check whether or not it's valid
    guard cfStrEncValue != kCFStringEncodingInvalidId else {
      return nil
    }

    // 4. Convert `CFStringEncoding` value to `String.Encoding` value
    self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
  }
}

What's the problem of the current solution?

It is complicated to use multiple CF functions to get a simple value. That's not Swifty.
CF functions are legacy APIs that do not always meet modern requirements.
CF APIs are not officially intended to be called directly from Swift on non-Darwin platforms.

Proposed solution

The solution is straightforward.
We introduce a computed property that returns the name, and the initializer that creates an instance from a name as shown below.

extension String.Encoding {
  /// The name of this encoding that is compatible with the one of the IANA registry "charset".
  public var name: String?

  /// Creates an instance from the name of the IANA registry "charset".
  public init?(name: String)
}

Detailed design

This proposal refers to "Character Sets" published by IANA because CF APIs do so.
However, as mentioned above, CF APIs are sometimes out of step with the times.
Therefore, we need to adjust it to some extent:

Graph of Encodings ↔︎ Names
The graph of String.Encoding-Name conversions

`String.Encoding` to Name

Upper-case letters may be used unlike CF.
- var name returns Preferred MIME Name or Name of the encoding defined in "IANA Character Sets".

Name to `String.Encoding`

init(name:) adopts "Charset Alias Matching" defined in UTX#22.
- i.g., "u.t.f-008" is recognized as "UTF-8".
init(name:) behaves consistently about ISO-8859-*.
- For example, CF inconsistently handles "ISO-8859-1-Windows-3.1-Latin-1" and "csWindows31Latin1".
- "ISO-8859-1-Windows-3.0-Latin-1" is a subset of "windows-1252", not of "ISO-8859-1".^[1]
- "ISO-8859-1-Windows-3.1-Latin-1" is a subset of "windows-1252", not of "ISO-8859-1".^[2]
- "ISO-8859-2-Windows-Latin-2" is a subset of "windows-1250", not of "ISO-8859-2".^[3]
- "ISO-8859-9-Windows-Latin-5" is a subset of "windows-1254", not of "ISO-8859-9".^[4]

Rationales for controversial points

While "ISO_646.irv:1983"(a.k.a. "Code page 1009") is resolved into .ascii by CF, it is, strictly speaking, incompatible with "US-ASCII".
This proposal decides that String.Encoding can't be initialized from "ISO_646.irv:1983".
"CP51932" was regarded as a variant of "EUC-JP" formulated by Microsoft.
It was, however, intended to be used mainly by web browsers (i.e. Internet Explorer considering the historical background) on Windows.
As a result, it is incompatible with the original "EUC-JP" widely used on UNIX.
Consequently, "CP51932" should not be associated with .japaneseEUC.
"CP932" is no longer available for a name of any encodings. Consequently, String.Encoding.shiftJIS.name returns "Shift_JIS".
"Windows-31J" is a variant of "Shift_JIS" extended by Microsoft.
For historical reasons, String.Encoding.shiftJIS is an encoding equivalent to kCFStringEncodingDOSJapanese in CF (not to kCFStringEncodingShiftJIS), which means that .shiftJIS should be created from the name "Windows-31J" as well.

Source compatibility

These changes proposed here are only additive. However, care must be taken if migrating from CF APIs.

Implications on adoption

This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.

Future directions

String.init(data:encoding:) and String.data(using:) will be implemented more appropriately^[5].

Hopefully, happening some cascades like below might be expected in the longer term.

General string decoders/encoders and their protocols (for example, as suggested in "Unicode Processing APIs") could be implemented.
Some types which provide their names and decoders/encoders could be implemented for the purpose of tightness between names and implementations.
- There would be a type for WHATWG Encoding Standard which defines both names and implementations.

They would look like...

public protocol StrawmanStringEncodingProtocol {
  static func encoding(for name: String) -> Self?
  var name: String? { get }
  var encoder: (any StringToByteStreamEncoder)? { get }
  var decoder: (any ByteStreamToUnicodeScalarsDecoder)? { get }
}

public struct IANACharset: StrawmanStringEncodingProtocol {
  public static let utf8: IANACharset = ...
  public static let shiftJIS: IANACharset = ...
  :
  :
}

public struct WHATWGEncoding: StrawmanStringEncodingProtocol {
  public static let utf8: WHATWGEncoding = ...
  public static let eucJP: WHATWGEncoding = ...
  :
  :
}

String.Encoding might be deprecated as a natural course in the distant future??

Alternatives considered

Adopting the WHATWG Encoding Standard (as well)

There is another standard for string encodings which is published by WHATWG: "Encoding Standard".
While it may claim the IANA's Character Sets could be replaced with it, it entirely focuses on Web browsers and their JavaScript APIs.
Furthermore it binds tightly names with implementations.
Since String.Encoding is just a RawRepresentable type where its RawValue is UInt, it is more universal but is more loosely bound to implementations.
As a result, WHATWG Encoding Standard doesn't easily align with String.Encoding. So it is just mentioned in "Future Directions".

Acknowledgments

Thanks to everyone who gave me advices on the pitch thread; especially to @benrimmington and @xwu who could channel their concerns into this proposal in the very early stage.