Pitch(Foundation): String Encoding Names

YOCKOW · September 16, 2024, 3:54am

Full and Latest version of the proposal is on gist: NNNN-String-Encoding-Names.md · GitHub

IANA Character Set Names

Proposal: Not assigned yet
Author(s): YOCKOW
Review Manager: TBD
Status: Pitch

Implementation: apple/swift-foundation#915

Review: (Pitch)

Introduction

This proposal lets String.Encoding be converted to/from the name of IANA Character Set.

For example:

print(String.Encoding.utf8.ianaCharacterSetName!) // Prints "utf-8"
print(String.Encoding(ianaCharacterSetName: "ISO-10646-UCS-4")! == .utf32) // Prints "true"

Motivation

The names for string encodings officially published by IANA registry are commonly used certainly in computer networking and in other areas.
You will often find them, for example, in HTTP headers such as Content-Type: text/plain; charset=UTF-8 ("UTF-8" is the one). You will also find them in XML documents such as <?xml version="1.0" encoding="Shift_JIS" ?> ("Shift_JIS" is the one).

As a natural consequence, it is necessary to parse and to generate the names of "charset", for example, when you generate/receive HTTP response.

Current solution

Swift is missing such APIs, therefore we have to use functions defined in CoreFoundation described as below.

`String.Encoding` → Name

let encoding: String.Encoding = ...

// 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
//    NOTE: The raw value of `String.Encoding` is the same with the value of `NSStringEncoding`,
//          while it is not equal to the value of `CFStringEncoding`.
let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(encoding.rawValue)

// 2. Convert it to the name where its type is `CFString?`
let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)

// 3. Convert `CFString` to Swift's `String`.
//    NOTE: Unfortunately they can not be implicitly casted on Linux.
let charsetName: String? = cfStrEncName.flatMap {
  let bufferSize = CFStringGetMaximumSizeForEncoding(
    CFStringGetLength($0),
    134217984 // utf-8
  ) + 1
  let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
  defer {
    buffer.deallocate()
  }
  guard CFStringGetCString($0, buffer, bufferSize, 134217984) else {
    return nil
  }
  return String(utf8String: buffer)
}

Name → `String.Encoding`

let charsetName: String = ...

// 1. Convert `String` to `CFString`
let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
  return CFStringCreateWithCString(nil, cString, 134217984)
}

// 2. Convert it to `CFStringEncoding`
let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)

// 3. Check whether or not it's valid
guard cfStrEncValue != kCFStringEncodingInvalidId else {
  fatalError("Invalid Name.")
}

// 4. Convert `CFStringEncoding` value to `String.Encoding` value
let encoding = String.Encoding(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))

What's the problem of current solution?

It is complicated to use multiple CF-functions to get a simple value. That's not Swifty.
CoreFoundation APIs are officially unavailable from Swift on non-Darwin platforms.

Proposed solution

Solution is simple.
We introduce a computed property that returns the name of IANA character set and an initializer that creates an instance from a name as below.

extension String.Encoding {
    public var ianaCharacterSetName: String? { get }
    public init?(ianaCharacterSetName: String)
}

Detailed design

CF Compatibility

As CF-functions such as CFStringConvertEncodingToIANACharSetName is widely used, this proposal maintains compatibility with CF-functions.
For example, String.Encoding.utf8.ianaCharacterSetName returns lowercased "utf-8" rather than "UTF-8" because CFStringConvertEncodingToIANACharSetName(0x08000100 /* kCFStringEncodingUTF8 */ ) returns "utf-8" alike.

Value to Name

Here is a table that shows which value is converted to the specific name.
var ianaCharacterSetName: String? { get } follows this rule.

[TABLE OMITTED: See gist]

Name to Value

Here is a table that shows which name is converted to the specific value of string encoding type.
init?(ianaCharacterSetName:) follows this rule case-insensitively.

[TABLE OMITTED: See gist]

Static properties

There are some missing static properties in String.Encoding that correspond to kCFStringEncoding*.
In order to make the values of String.Encoding initialized with init?(ianaCharacterSetName:) meaningful, we add static properties to String.Encoding:

[CODE OMITTED: See gist]

Source compatibility

These changes proposed here are only additive, and also designed for maintaining compatibility with CF-functions.

Implications on adoption

This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.

Future directions

As one possible case, String.init(data:encoding:) and string.data(using:) support natively more string encodings added here.

Alternatives considered

Elide `iana` prefix

We may elide iana prefix from the computed property and the label of the initializer.
However, "Character Set" has historically a different meaning in Foundation. That is why we should add iana prefix to CharacterSetName.

benrimmington · September 16, 2024, 5:46pm

Perhaps these are intentionally missing from the FoundationEssentials module?

CFStringBuiltInEncodings has 13 encodings, which are implemented within CoreFoundation.
CFStringEncodings has 128 encodings, which are implemented by the ICU library (as far as I can tell).
String.Encoding has the 13 built-in encodings, and 9 others which aren't currently supported by swift-foundation APIs.
String.availableStringEncodings returns 105 encodings (macOS Sequoia).

YOCKOW · September 17, 2024, 2:31am

My thought here is that:

We'd better to enable to convert mutually the names and instances that represent string encodings as much as possible (within "CF compatibilities").
Deciding which encodings should be available (i.g. for String(data:encoding:)) is "future direction".

One of my concerns is about Shift JIS (Shift_JIS) which is a common string encoding in Japan.
Surprisingly String.Encoding.shiftJIS~~, which is one of built-in encodings,~~ does NOT strictly represent Shift JIS.

As you can see in tables of my proposal:

(To name)

The value of String.Encoding.shiftJIS is 0x00000008.
The corresponding CF value is 0x00000420 which is kCFStringEncodingDOSJapanese.
Its name of IANA Character Set is "cp932"^[1], not "Shift_JIS".

(From name)

CF value converted from "Shift_JIS" is 0x00000a01 (kCFStringEncodingShiftJIS).
NS value converted from kCFStringEncodingShiftJIS is 0x80000a01, which is not defined as a static property of String.Encoding.

I mean "Shift JIS" would be a lost child (or an unnamed encoding) at this rate.
That's why this proposal contains additions of static properties.

CP932 is based on Shift JIS, but is extended by Microsoft. It's not the same with Shift JIS. ↩︎

benrimmington · September 17, 2024, 3:47pm

The term charset is preferred by RFC2978, so ianaCharsetName might be the correct spelling. But some names (e.g. "x-nextstep") aren't registered with IANA.

ICU and UTS #22 also ignore punctuation, spaces, and leading zeros.

I'm not sure that static properties should be added yet. If possible:

FoundationEssentials would have static properties only for built-in encodings.
FoundationInternationalization might have static properties for ICU encodings, or they might only be available dynamically?
When both modules are imported, existing APIs would support additional encodings.

jmschonfeld · September 17, 2024, 5:04pm

It's great to see a pitch for this API! Definitely something that's been missing from String compared to CFString that would be very useful as you mentioned.

I tend to agree with this approach. For some background, the reason that CoreFoundation provides so many more encodings than Foundation is because CoreFoundation was implemented to call into Carbon on macOS for older applications which provides encoding conversions for many of these encodings. Foundation itself only supports a small subset of these encodings which I believe is likely roughly equal to the encodings defined as NSStringEncoding/String.Encoding constants today. I think it might be a bit confusing if String.Encoding were to have all of these encodings, but Foundation APIs that used encodings never supported them.

For some common encodings, we could look at lowering them to FoundationEssentials or adding new String.Encoding options if we deem it's important enough for Foundation to implement the conversion (for example, in Swift 6 we lowered isoLatin1 and macOSRoman encoding conversion from Foundation to FoundationEssentials since those encoding conversions were commonly used by URL-related APIs. In the fullness of time, I'd love to see an approach that @benrimmington mentioned where String.Encoding in FoundationEssentials only has encoding conversions provided by FoundationEssentials, but FoundationInternationalization can add on additional ICU encodings.

@YOCKOW do you think these APIs would be valuable with just the String.Encodings that are available today / without adding new static properties since that's what Foundation supports converting today?

YOCKOW · September 18, 2024, 6:24am

I’ll add a description about it in “alternatives considered” section.
May we elide ‘iana’ prefix if we adopt the term “charset”?

Thank you for pointing this out.
Indeed CFStringConvertIANACharSetNameToEncoding follows that.
I’ll fix the description of the proposal and its implementation.

That depends on a range of “available”.

Please forgive me for insisting on significance
of Shift JIS again.

In short;

String.Encoding(ianaCharacterSetName: "Shift_JIS") should not return nil.

Rationale

First of all, 15% of web sites in Japan still use Shift JIS. (UTF-8 is 80%.) ^[1]

Hence, we can’t ignore Shift JIS so far (unless we ignore Japan).

While String(data: data, encoding: .shiftJIS) can decode characters from Shift-JIS-encoded data, we will have troubles if APIs proposed here are limited with just the String.Encodings that are available today.
If the server sends charset=Shift_JIS, we can’t determine the string encoding because
String.Encoding.shiftJIS corresponds to CP932 as commented above and String.Encoding(ianaCharacterSetName: "Shift_JIS") will return nil with such implementation, albeit Shift JIS can be subset of CP932.

Conclusion?

Even if we don’t add new static properties, some non-nil value should be returned so that we may know the encoding is subset/superset of the already-available encoding.
…I think adding some static properties (which are compatible with encodings available currently?) to FoundationInternationalization is practical.

(Sorry, written in Japanese) HTMLの文字コードシェア調査ー上場企業3,600社トップページのcharsetを調べてみた – 名古屋のWebシステム開発　iNet Solutions ↩︎

xwu · September 18, 2024, 12:11pm

Do they really use standardized Shift JIS, or do they really use the version with Microsoft's vendor extensions which is CP932 (Windows-31J)? Your source doesn't say, but it does say that Shift JIS was widely used because of the ubiquity of Windows.

Edit: Wikipedia gives the answer—"many people and software packages, including Microsoft libraries, declare the Shift JIS encoding for Windows-31J data, although it includes some additional characters, and some of the existing characters are mapped to Unicode differently. This has led the WHATWG HTML standard to treat the encoding labels shift_jis and windows-31j interchangeably, and use the Windows variant for its 'Shift_JIS' encoder and decoder."

Therefore, the existing behavior is the more interoperable one, and it would be confusing if any added APIs didn't preserve it.

Edit 2: Based on the sources given, it sounds like the WHATWG has a new Encoding Standard (https://encoding.spec.whatwg.org/) which deliberately diverges from IANA to address issues of interoperability between user agents, and the language of the standard reads as though it's deliberately superseding IANA (including abolishing any reliance on its registry and removing charset extensibility for security reasons) for the purposes of the Web. A Rust implementation details the specific differences as follows—

In some cases, the Encoding Standard specifies the popular unextended encoding name where in IANA terms one of the other labels would be more precise considering the extensions that the Encoding Standard has unified into the encoding.

Encoding IANA

Big5 Big5-HKSCS

EUC-KR windows-949

Shift_JIS windows-31j

x-mac-cyrillic x-mac-ukrainian

In other cases where the Encoding Standard unifies unextended and extended variants of an encoding, the encoding gets the name of the extended variant.

IANA Unified into Encoding

ISO-8859-1 windows-1252

ISO-8859-9 windows-1254

TIS-620 windows-874

Since the relevant Web standard seems to have moved past IANA charsets, shouldn't we be implementing the superseding standard, particularly given that existing Swift APIs already align with it?

Edit 3: This superseding standard also has the salutary effect that it uses the term "encoding" rather than "charset"—which already aligns with Swift usage. Indeed, the more I reflect on it, the more it seems that String.Encoding behavior has been deliberately designed to reflect the current Encoding Standard.

benrimmington · September 18, 2024, 1:45pm

~~Perhaps those algorithms can be implemented in FoundationEssentials, to support the existing String.Encoding.shiftJIS on all platforms.~~

(Or maybe not, the index-jis0208.txt file is too large.)

Another alternative is init(name:) and var name.

I'd simplify the "current solution" to:

extension String.Encoding {

  init?(name: String) {
    let encoding = CFStringConvertIANACharSetNameToEncoding(name as CFString)
    guard encoding != kCFStringEncodingInvalidId else { return nil }
    self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(encoding))
  }

  var name: String? {
    let encoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)
    guard encoding != kCFStringEncodingInvalidId else { return nil }
    return CFStringConvertEncodingToIANACharSetName(encoding) as? String
  }
}

(And helper methods for Linux, to replace the CFString bridging.)

YOCKOW · September 19, 2024, 1:02am

You’re right. Any Shift JIS data can be decoded as CP932(Windows-31J).
On the other hand, the point is that there would be problems to parse the name Shift_JIS if “available encodings” are limited to current String.Encodings.

For example, imagine accessing to https://www.nintendo.co.jp/ngc/index.html.
(Note: Nintendo is a famous game company in Japan.)

You can find <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Shift_JIS"> in HTML of that page.
Parse Shift_JIS as a charset name to decode the characters of the HTML.
There is no instance of current String.Encoding corresponding to Shift_JIS (if following current CF implementation).
You might not detect the correct encoding of the page.

One possible solution is giving up CF compatibility.
I mean it is ok that parsing Shift_JIS returns String.Encoding.shiftJIS.
I wonder how many people care about CF compatibility.

I wanted to emphasize the complexity of use of CF functions.

Anyway, there seems to be two controversial points so far.

What spelling is approproate for the property name (and the label of the initializer)?
Which "charset" names should be available?

I will rewrite the proposal and implementation in several days (or several weeks).

xwu · September 19, 2024, 3:50am

If CF's behavior is not consistent with the Encoding Standard, I would argue that we must not have that propagate into modern Swift APIs.

YOCKOW · September 26, 2024, 6:48am

I've revised the pitch: NNNN-String-Encoding-Names.md · GitHub

The proposal now refers to "The Encoding Standard" as well as "IANA Character Sets".
I mean that's a compromised solution because I noticed it was difficult to make "The Encoding Standard" alone fit into current Swift APIs.
The plan described in "Alternatives considered" section might be preferred, though.

What's changed?

Renamed the title: "String Encoding Names"
Changed the strategy: Referring to IANA & WHATWG; CF compatibility is on a best-effort basis. No static properties are added.

String Encoding Names

Proposal: Not assigned yet
Author(s): YOCKOW
Review Manager: TBD
Status: Pitch

Implementation: Coming Soon

Review: (Pitch)

Introduction

This proposal lets String.Encoding be converted to/from names.

For example:

print(String.Encoding.utf8.name!) // Prints "UTF-8"
print(String.Encoding(name: "ISO-10646-UCS-4")! == .utf32) // Prints "true"

Motivation

The names for string encodings are commonly used certainly in computer networking and in other areas.
You will often find them, for instance, in HTTP headers such as Content-Type: text/plain; charset=UTF-8 ("UTF-8" is the one). You will also find them in XML documents such as <?xml version="1.0" encoding="Shift_JIS" ?> ("Shift_JIS" is the one).

As a natural consequence, it is necessary to parse and to generate such names, for example, when you generate/receive HTTP response.

Current solution

Swift is missing such APIs, therefore we have to use functions defined in CoreFoundation (hereinafter called "CF") described as below.

extension String.Encoding {
  var nameInLegacyWay: String? {
    // 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
    //    NOTE: The raw value of `String.Encoding` is the same with the value of `NSStringEncoding`,
    //          while it is not equal to the value of `CFStringEncoding`.
    let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)

    // 2. Convert it to the name where its type is `CFString?`
    let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)

    // 3. Convert `CFString` to Swift's `String`.
    //    NOTE: Unfortunately they can not be implicitly casted on Linux.
    let charsetName: String? = cfStrEncName.flatMap {
      let bufferSize = CFStringGetMaximumSizeForEncoding(
        CFStringGetLength($0),
        kCFStringEncodingASCII
      ) + 1
      let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
      defer {
        buffer.deallocate()
      }
      guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else {
        return nil
      }
      return String(utf8String: buffer)
    }
    return charsetName
  }

  init?(fromNameInLegacyWay charsetName: String) {
    // 1. Convert `String` to `CFString`
    let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
      return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII)
    }

    // 2. Convert it to `CFStringEncoding`
    let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)

    // 3. Check whether or not it's valid
    guard cfStrEncValue != kCFStringEncodingInvalidId else {
      return nil
    }

    // 4. Convert `CFStringEncoding` value to `String.Encoding` value
    self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
  }
}

What's the problem of the current solution?

It is complicated to use multiple CF-functions to get a simple value. That's not Swifty.
CF-functions are legacy APIs that don't sometimes fit with the times.
CF APIs are officially unavailable from Swift on non-Darwin platforms.

Proposed solution

Solution is simple.
We introduce a computed property that returns the name and an initializer that creates an instance from a name as below.

extension String.Encoding {
    public var name: String? { get }
    public init?(name: String)
}

Detailed design

This proposal refers to "Character Sets" published by IANA and to "The Encoding Standard" published by WHATWG.
While the latter may claim the former could be replaced with it, it focuses on Web browsers (and their JavaScript APIs).
Since Swift APIs should be a little more universal ^[1], here we consult both.

How to compromise

Not to be too arbitrary, this proposal stipulates these principles:

Focus on the encodings that are available publicly on swift-foundation at this point.
- Exception: Consider compatibility for possible additions (or exposure) of ISO-8859-* encodings in the future. ^[2]
Treat encodings as different ones if their rawValues differ.
Keep affordance: String.Encoding(name: someEncoding.name!) == someEncoding is supposed to be true.
CF compatibility is on a best-effort basis.

`String.Encoding` to Name

Here is a table that shows the proposed names corresponding to each encoding.

`String.Encoding`	CF Name Ouput	IANA	WHATWG	Proposed Name
`.ascii`	us-ascii	US-ASCII	windows-1252	US-ASCII
`.nextstep`	x-nextstep	n/a	x-user-defined	x-nextstep
`.japaneseEUC`	euc-jp	EUC-JP	EUC-JP	EUC-JP
`.utf8`	utf-8	UTF-8	UTF-8	UTF-8
`.isoLatin1`	iso-8859-1	ISO-8859-1	windows-1252	ISO-8859-1
`.symbol`	x-mac-symbol	n/a	x-user-defined	x-mac-symbol
`.nonLossyASCII`	n/a	n/a	n/a	n/a
`.shiftJIS`	cp932	n/a	n/a	Shift_JIS
`.isoLatin2`	iso-8859-2	ISO-8859-2	ISO-8859-2	ISO-8859-2
`.unicode`	utf-16	UTF-16	UTF-16LE	UTF-16
`.windowsCP1251`	windows-1251	windows-1251	windows-1251	windows-1251
`.windowsCP1252`	windows-1252	windows-1252	windows-1252	windows-1252
`.windowsCP1253`	windows-1253	windows-1253	windows-1253	windows-1253
`.windowsCP1254`	windows-1254	windows-1254	windows-1254	windows-1254
`.windowsCP1250`	windows-1250	windows-1250	windows-1250	windows-1250
`.iso2022JP`	iso-2022-jp	ISO-2022-JP	ISO-2022-JP	ISO-2022-JP
`.macOSRoman`	macintosh	macintosh	macintosh	macintosh
`.utf16BigEndian`	utf-16be	UTF-16BE	UTF-16BE	UTF-16BE
`.utf16LittleEndian`	utf-16le	UTF-16LE	UTF-16LE	UTF-16LE
`.utf32`	utf-32	UTF-32	n/a	UTF-32
`.utf32BigEndian`	utf-32be	UTF-32BE	n/a	UTF-32BE
`.utf32LittleEndian`	utf-32le	UTF-32LE	n/a	UTF-32LE

What's changed from legacy names?

Upper-case letters are used if desirable.
String.Encoding.shiftJIS.name returns "Shift_JIS" since "CP932" is no longer available for a name of any encodings.

Name to `String.Encoding`

This proposal provides a rule for converting names to String.Encodings in order to conform the principles aforementioned.
At first glance, it may look convoluted, but the results obtained from the rule are apprehensible.

The Rule

Definitions

'input matches IANA charset "foo"' means that the given input matches the name or one of the aliases of IANA charset specified by "foo" using "Charset Alias Matching" method.
'input matches WHATWG encoding "bar"' means that the given input matches one of the labels of WHATWG encoding specified by "bar" case-insensitively.
'input is "baz"' means that the given input matches case-insensitively "baz".

Procedures

If input matches IANA charset "US-ASCII" or "ISO_646.irv:1983", return .ascii.
If input is "ascii", return .ascii.
If input matches IANA charset "ISO-8859-1", return .isoLatin1.
If input matches IANA charset "ISO-8859-9", return nil (or it should be .isoLatin5 if exists).
If input matches IANA charset "TIS-620", return nil (or it should be .isoLatinThai if exists).
If input matches IANA charset "ISO-10646-UCS-2", "UNICODE-1-1", or "UTF-16", return .unicode.
If input is "ucs-2" or "unicode", return .unicode.
If input matches one of WHATWG encodings below (except already matched above), return a corresponding String.Encoding instance :
- "EUC-JP" → .japaneseEUC
- "UTF-8" → .utf8
- "Shift_JIS" → .shiftJIS
- "ISO-8859-2" → .isoLatin2
- "windows-1251" → .windowsCP1251
- "windows-1252" → .windowsCP1252
- "windows-1253" → .windowsCP1253
- "windows-1254" → .windowsCP1254
- "windows-1250" → .windowsCP1250
- "ISO-2022-JP" → .iso2022JP
- "macintosh" → .macOSRoman
- "UTF-16BE" → .utf16BigEndian
- "UTF-16LE" → .utf16LittleEndian
If input matches one of IANA charsets below, return a corresponding String.Encoding instance:
- "CP51932" → .japaneseEUC
- "EUC-JP" → .japaneseEUC
- "UTF-8" → .utf8
- "Adobe-Symbol-Encoding" → .symbol
- "Windows-31J" → .shiftJIS
- "ISO-8859-2" → .isoLatin2
- "windows-1251" → .windowsCP1251
- "windows-1252" → .windowsCP1252
- "ISO-8859-1-Windows-3.0-Latin-1" → .windowsCP1252
- "ISO-8859-1-Windows-3.1-Latin-1" → .windowsCP1252
- "windows-1253" → .windowsCP1253
- "windows-1254" → .windowsCP1254
- "ISO-8859-9-Windows-Latin-5" → .windowsCP1254
- "windows-1250" → .windowsCP1250
- "ISO-8859-2-Windows-Latin-2" → .windowsCP1250
- "ISO-2022-JP" → .iso2022JP
- "macintosh" → .macOSRoman
- "UTF-32" → .utf32
- "ISO-10646-UCS-4" → .utf32
- "UTF-16BE" → .utf16BigEndian
- "UTF-16LE" → .utf16LittleEndian
- "UTF-32BE" → .utf32BigEndian
- "UTF-32LE" → .utf32LittleEndian
If input is "x-nextstep", return .nextstep.
If input is "x-mac-symbol", return .symbol.
If input matches none of the above, return nil.

NOTE: Actual implementation may contain some "by-passes" for the purpose of performance optimization.

Mapping

Here is a table that shows the results of the rule as mentioned above:

[TABLE OMITTED; See gist]

What's changed from legacy API?

Some names that are not supported by CF are available conforming to latest standard. And vice versa.
String.Encoding(name: "Shift_JIS") returns .shiftJIS.
Inconsistency about ISO-8859-* is fixed.
- For example: "ISO-8859-1-Windows-3.1-Latin-1" vs "csWindows31Latin1"

Source compatibility

These changes proposed here are only additive. However, care must be taken if migrating from CF APIs.

Implications on adoption

This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.

Future directions

More encodings and their names may become available on swift-foundation.

Alternatives considered

Don't compromise

We may be able to provide the computed properties and the initializers separately for IANA charsets and WHATWG encodings.

That is:

extension String.Encoding {
    /// Returns the name of IANA charset.
    var charsetName: String? { get }

    /// Returns the name of WHATWG encoding.
    var standardName: String? { get }

    /// Creates an instance from the name of IANA charset.
    init?(charsetName: String)

    /// Creates an instance from the name of WHATWG encoding.
    init?(standardName: String)
}

Acknowledgments

Thanks to everyone who gave me advices on the pitch thread; especially to @benrimmington and @xwu.

You may just want to parse an old XML document on local. ↩︎
The Encoding Standard defines ISO-8859-1, ISO-8859-9, and ISO-8859-11 as mere labels of other encodings. That doesn't fit into even current Swift APIs. ↩︎

xwu · September 26, 2024, 1:39pm

The Encoding Standard is the compromise between the prior standard and actual implementations.

We really should not invent a new compromise of the compromise: as the standard explains, there are real security concerns that can be exploited when a document is decoded with more or less laxity than is specified.

IMO, being “more universal” should not only not be a goal: it should be an anti-goal. Refusing to return an encoding at all when there is ambiguity, even trapping at runtime, is more consistent with safety and security goals.

Put another way, if there is an irreconcilable conflict between what Swift already supports and the most current standard, I would urge any new API to represent the intersection of those existing Swift APIs and the Encoding Standard, not the union.

YOCKOW · September 27, 2024, 8:17am

I admit The Encoding Standard is a standard; but I'd rather say it is one of standards.
IANA registry is still alive as, in fact, the list of charsets was updated recently, in June 2024.

Depending on only The Encoding Standard may induce, for example, the "ISO-8859-1 vs windows-1252" issue to recur.
The Encoding Standard defines ISO-8859-1 as a mere label of windows-1252.
That is not because of any security issue, but because of the infamous historical reason that Microsoft gave Windows-1252 a misnomer "ISO-8859-1"(or "ANSI").
(Note that ISO-8859-1 is the default encoding of HTTP/1.1^[1] and HTTP/1.1 is (unfortunately) still used to a certain extent^[2].)

Looking back to current Swift APIs, the rawValue of .isoLatin1 (that represents ISO-8859-1) is 0x05 while the one of .windows1252 is 0x0C.
That is to say, Swift Foundation distinguishes ISO-8859-1 from windows-1252 (as IANA does).
In practical terms, ISO-8859-1 and windows-1252 are not compatible for some code points; Swift behaves as such:

import Foundation

let data = Data([0x85])
let decodedAsISO8859_1 = String(data: data, encoding: .isoLatin1)!
let decodedAsWindows1252 = String(data: data, encoding: .windowsCP1252)!

print(decodedAsISO8859_1 == "\u{0085}") // Prints "true"
print(decodedAsWindows1252 == "\u{2026}") // Prints "true"
print(decodedAsISO8859_1 == decodedAsWindows1252) // Prints "false", of course

This is definitely "an irreconcilable conflict between what Swift already supports and the most current standard" if the most current standard means The Encoding Standard.
This is not so if the most current standard means IANA Character Sets.

As for HTML, WHATWG had certainly superseded W3C's standard.
What about others?
For example, CSS is still maintained by W3C. And Mozilla, one of members of WHATWG, also says @charset syntax in CSS must use the name of character encoding defined in the IANA-registry^[3].
IANA registry is still alive even for WHATWG in such cases.

Now the point of the argument is which is an appropriate standard for string encodings.
We might not answer that unconditionally.

I'm tending to think that the "alternative" could be more suitable so that users can decide which should be used in each case:

extension String.Encoding {
    /// Returns the name of IANA charset.
    var charsetName: String? { get }

    /// Returns the name of WHATWG encoding.
    var standardName: String? { get }

    /// Creates an instance from the name of IANA charset.
    init?(charsetName: String)

    /// Creates an instance from the name of WHATWG encoding.
    init?(standardName: String)
}

xwu · September 27, 2024, 3:55pm

FYI, RFC 7231 removed ISO-8859-1 as the default encoding of HTTP/1.1

The CSS standard only requires that user agents support UTF-8 encoding. In general, the control characters that ISO-8859-1 maps to but not Windows 1252 have no use on the web, so the Encoding Standard is correct in codifying standard practice that considers "ISO-8859-1" to be Windows 1252. If you can find any modern browser that does not reflect this behavior when it handles CSS files, I'd be quite surprised.

This doesn't mean that for the Encoding Standard is suitable for all possible clients of Foundation, since the web isn't everything, but it also doesn't mean that those usages are general purpose enough to merit a dedicated, duplicate set of APIs which need to track a constantly updated registry.

YOCKOW · September 28, 2024, 12:05am

Thank you.
My brain was living in the ancient age.

The next controversial point may be how big the merit is, I guess.

My excuse
Although I was nagging (sorry), my firm belief is "DO NOT USE OTHER THAN UTF-8".
I know almost all of textual files in the world are encoded with UTF-8 nowadays.
But, ...therefore, when you encounter non-UTF-8 files, they must be old-fashioned. That is, old-style approaches may be required.
We would have to be able to know whether or not they are necessary.

At any rate, I have to revise the proposal again.

benrimmington · September 30, 2024, 1:04am

Two (deprecated?) encodings could be removed from the proposal:

.nonLossyASCII is only implemented by NSString APIs.
.unicode might be replaced by its .utf16 alias, to align with the proposed "UTF-16" name (and the related .utf16BigEndian and .utf16LittleEndian encodings).

Could unsupported rows be removed from the second table? (854 out of 971 rows don't contain a proposed encoding.)

Both tables contain x-user-defined, which seems unrelated to x-mac-symbol and x-nextstep.

YOCKOW · September 30, 2024, 5:30am

That's my fault. My script to generate the tables seems to have some bugs and I was so lazy that I had copy-and-pasted them. I'll fix, adjust, and compact them (if tables will be necessary in the first place).

As a personal side note: We may have to consider The Encoding Standard always assumes UTF-16 is little endian.

benrimmington · September 30, 2024, 3:46pm

When decoding, if there isn't a byte-order mark (BOM) then String.Encoding.utf16 assumes big-endian.

jmschonfeld · October 4, 2024, 10:04pm

Sorry for the delay, catching up a bit on this thread...

I tend to agree with these sentiments. In general Foundation has found it problematic to maintain APIs that perform loose/fuzzy matching and while we don't need to be perfectly rigid in general I think we should try to avoid accepting ambiguity and be as clear as we can to avoid unexpected behavior.

To that end, the alternatives considered section mentions the following API:

extension String.Encoding {
    /// Returns the name of IANA charset.
    var charsetName: String? { get }

    /// Returns the name of WHATWG encoding.
    var standardName: String? { get }

    /// Creates an instance from the name of IANA charset.
    init?(charsetName: String)

    /// Creates an instance from the name of WHATWG encoding.
    init?(standardName: String)
}

I wonder if this merits more discussion if the selection of a specific naming scheme becomes too problematic. We already have a precident for this with Locale identifiers, for example: identifier(_:) | Apple Developer Documentation. This allows you to take a given Locale and produce an identifier of varying specifications (bcp47, icu, cldr). The initializer only accepts a specific standard, but Locale also provides a way to convert between the standards when possible. Given this precedent for Locale, I think maybe it makes sense to investigate this for the Encoding identifiers since I feel they might fall into the same boat. @itingliu do you have any thoughts on this since you've worked more closely with those Locale APIs?

YOCKOW · October 6, 2024, 3:50am

@jmschonfeld Thank you for follow-up and bringing up the precedent.

As a basis of discussion, I'd say I think conflicts are more likely to occur with the initializer as to String.Encoding (unlike Locale?).
The following graph may show:

Encoding	IANA
Big5	Big5-HKSCS
EUC-KR	windows-949
Shift_JIS	windows-31j
x-mac-cyrillic	x-mac-ukrainian

IANA	Unified into Encoding
ISO-8859-1	windows-1252
ISO-8859-9	windows-1254
TIS-620	windows-874

Pitch(Foundation): String Encoding Names

IANA Character Set Names

Introduction

Motivation

Current solution

String.Encoding → Name

Name → String.Encoding

What's the problem of current solution?

Proposed solution

Detailed design

CF Compatibility

Value to Name

Name to Value

Static properties

Source compatibility

Implications on adoption

Future directions

Alternatives considered

Elide iana prefix

In short;

Rationale

Conclusion?

String Encoding Names

Introduction

Motivation

Current solution

What's the problem of the current solution?

Proposed solution

Detailed design

How to compromise

String.Encoding to Name

What's changed from legacy names?

Name to String.Encoding

The Rule

Definitions

Procedures

Mapping

What's changed from legacy API?

Source compatibility

Implications on adoption

Future directions

Alternatives considered

Don't compromise

Acknowledgments

`String.Encoding` → Name

Name → `String.Encoding`

Elide `iana` prefix

`String.Encoding` to Name

Name to `String.Encoding`