Pitch(Foundation): String Encoding Names

So, yes - legacy encodings are useful, even if we don't have any name parsing.

I find it difficult to parse the rest of your question. Anything that processes legacy text would, of course, benefit from support for legacy text encodings; but many of those don't need the web-compatible names, and in fact the proposal includes an example of when using the web-compatible name would be incorrect (XML documents).

And by the way, I'm not saying it isn't worth adding. I just think it's interesting and that we should consider to what extent we want Foundation APIs to formally reflect standards designed for the web.

3 Likes

I agree the extent to which Foundation should reflect standards is still an open question. But we do continue to add new API to support particular standards. For example the calendar recurrence rule was designed to work with RFC 5545
The HTTP date format was also designed so it works specifically for HTTP.

2 Likes

I'm not sure what should be formal, but now I think we have some good reasons for having APIs for IANA and WHATWG at least.

  • IANA is a standards organization that has been in existence for over 3 decades (owned by ICANN now though) and is in fact responsible for maintaining "charset" names. CF is based on IANA.
  • WHATWG provides de-facto standards that follow the ways in the real world (web browsers). :thought_balloon:Also the fact that one of members of WHATWG is Apple is important?

Anyway, we should leave NameType non-@frozen for some unseen standards in the future, maybe.

I think this proposal needs to more deeply consider the underlying encoding implementation of each encoding and also separate names from labels. E.g., while the WHATWG Encoding standard maps the "ascii" label to "windows-1252", there's no encoding whose name is "ascii" (or "ASCII").

And also what the stability guarantees are of that encoding implementation. E.g., much of the industry had to change the encoding WHATWG calls "gb18030" due to GB18030-2022 recently being issued. Are such changes possible in Foundation or would that warrant a new encoding implementation of sorts (and thus a new name)?

As another example: while both IANA and WHATWG define Shift_JIS, the intended meaning is quite a bit different.

Also, IANA tends to be very light on requirements of the actual encoding implementation, while WHATWG requires exact handling for any given input, including input that can be considered erroneous. Do the Foundation encoding implementations meet these requirements? If not, using WHATWG names or labels to identify them would be rather misleading.

I hope this helps you refine the proposal further.

5 Likes

I'm surprised and glad that the author of Encoding Standard took a look at this pitch. Thank you so much.

I think, in general, this kind of APIs are unavoidable to come to a compromise.
I mean any standards would be converted to "Foundation's standard" to some extent through such APIs.
In fact, even current CFStringConvertIANACharSetNameToEncoding(_:) returns CFStringEncoding instance that is the closest mapping, the document says.


Do you mean String.Encoding(whatwg: "ascii") should return nil?
Or should we explicitly describe exactly that it's a label such as String.Encoding(whatwgLabel: "ascii")?
(In case of WebKit, for example, PAL::TextEncoding's constructor takes a string-like object and is constructed as "windows-1252" when "ascii" is passed[1], IIUC.)


At least, in this pitch, we focus on currently available encodings.
Although it'll be hard for Foundation to reflect every change, we can catch up the standard as to limited encodings at the timing of minor/major releases.
...That may be my wishful thinking?


I agree that Shift_JIS is required to be treated specifically.
As mentioned in the pitch, String.Encoding.shiftJIS is historically derived from kCFStringEncodingDOSJapanese in CF (not from kCFStringEncodingShiftJIS), which means that it's better that .shiftJIS should be treated as "Windows-31J" rather than simple "Shift_JIS".
However, as you know, Windows-31J can be deemed as a variant of Shift_JIS in practice.


It's a tough question because String.init(data:encoding:) is now unfortunately broken. We may be able to consider the implementation when we fix it.


Big picture?

If we want strict correspondence between names and encoding/decoding implementations, we could get inspired by @Karl's suggestion:

However, for example, ISO-2022-JP would never fit to conform to Unicode.Encoding (right?).
Hence legacy encodings would need other protocols just like:

public protocol StringEncodingProtocol {
  var name: String? { get }
  static func encoding(from name: String) -> Self?
  func encode<S>(_ string: S) throws -> Data? where S: StringProtocol
  func decode<D>(_ data: D) throws -> String? where D: DataProtocol
}

public struct IANACharset: StringEncodingProtocol {
  ...
}

public struct WHATWGEncoding: StringEncodingProtocol {
  ...
}


  1. WebKit/Source/WebCore/PAL/pal/text/TextEncoding.cpp at c57f1a4ede62f9ffef152bba257582c823fa9b6a · WebKit/WebKit · GitHub ↩︎

ByteStreamDecoder might be a better alternative than Unicode.Encoding because:

  • endianness and byte-order marks could be supported,
  • conformance wouldn't require both ForwardParser and ReverseParser,
  • name lookup could return an instance rather than a metatype.
1 Like

(:thought_balloon:Ironically, it's the pitch which gave me the urge to throw the other tiny pitch.)

I agree ByteStreamDecoder would be better there. We might desire a protocol for encoders though.

public protocol StringEncodingProtocol {
  ...
  var decoder: (any Unicode.ByteStreamDecoder)? { get }
  var encoder: (any Unicode.ByteStreamEncoder)? { get }
}

Back to this pitch, how tight should names and implementations be?

  • Turn a blind eye to concrete encoding implementations.
  • Roll back the pitch to support only IANA things.
  • Introduce unique protocol(s) to bind names and encoders/decoders.
  • Hold on until other evolutions such as "Unicode Processing APIs" which will be depended on by string encoding name APIs are accepted.

:thinking:

Let me share the pitch #5 here.

Moving WHATWG things to the Future Directions, APIs are narrowed down to IANA-based conversions. On the other hand, we don't aim at full compatibility with CF APIs.
I think this is down to earth in a realistic manner for the time being.


String Encoding Names

  • Proposal: Not assigned yet
  • Author(s): YOCKOW
  • Review Manager: TBD
  • Status: Pitch

Revision History

Pitch#1

  • Features
    • Fully compatible with CoreFoundation.
      • Planned to add static properties corresponding to kCFStringEncoding*.
    • Spelling of getter/initializer was ianaCharacterSetName.
  • Pros
    • Easy to migrate from CoreFoundation.
  • Cons
    • Propagating undesirable legacy conversions into current Swift Foundation.
    • Including string encodings which might not be supported by Swift Foundation.

Pitch#2

  • Features
  • Pros
    • Easy to communicate with API.
  • Cons
    • Hard for users to comprehend conversions.
    • Difficult to maintain the API in a consistant way.

Pitch#3, Pitch#4

  • Features
    • Consulting both IANA Character Sets and WHATWG Encoding Standard.
    • Separated getters/initializers for them.
      • #3: charsetName and standardName respectively.
      • #4: name(.iana) and name(.whatwg) for getters; init(iana:) and init(whatwg:) for initializers.
  • Pros
    • Users can recognize what kind of conversions is used.
  • Cons
    • Not reflecting the fact that WHATWG's Encoding Standard doesn't provide only string encoding names but also implementations to encode/decode data.

Pitch#5

This pitch.

Introduction

This proposal allows String.Encoding to be converted to and from various names.

For example:

print(String.Encoding.utf8.name!) // Prints "UTF-8"
print(String.Encoding(name: "ISO_646.irv:1991") == .ascii) // Prints "true"

Motivation

String encoding names are widely used in computer networking and other areas. For instance, you often see them in HTTP headers such as Content-Type: text/plain; charset=UTF-8 or in XML documents with declarations such as <?xml version="1.0" encoding="Shift_JIS"?>.

Therefore, it is necessary to parse and generate such names.

Current solution

Swift lacks the necessary APIs, requiring the use of CoreFoundation (hereinafter called "CF") as described below.

extension String.Encoding {
  var nameInLegacyWay: String? {
    // 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
    //    NOTE: The raw value of `String.Encoding` is the same as the value of `NSStringEncoding`,
    //          while it is not equal to the value of `CFStringEncoding`.
    let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(self.rawValue)

    // 2. Convert it to the name where its type is `CFString?`
    let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)

    // 3. Convert `CFString` to Swift's `String`.
    //    NOTE: Unfortunately they can not be implicitly casted on Linux.
    let charsetName: String? = cfStrEncName.flatMap {
      let bufferSize = CFStringGetMaximumSizeForEncoding(
        CFStringGetLength($0),
        kCFStringEncodingASCII
      ) + 1
      let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
      defer {
        buffer.deallocate()
      }
      guard CFStringGetCString($0, buffer, bufferSize, kCFStringEncodingASCII) else {
        return nil
      }
      return String(utf8String: buffer)
    }
    return charsetName
  }

  init?(fromNameInLegacyWay charsetName: String) {
    // 1. Convert `String` to `CFString`
    let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
      return CFStringCreateWithCString(nil, cString, kCFStringEncodingASCII)
    }

    // 2. Convert it to `CFStringEncoding`
    let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)

    // 3. Check whether or not it's valid
    guard cfStrEncValue != kCFStringEncodingInvalidId else {
      return nil
    }

    // 4. Convert `CFStringEncoding` value to `String.Encoding` value
    self.init(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
  }
}

What's the problem of the current solution?

  • It is complicated to use multiple CF functions to get a simple value. That's not Swifty.
  • CF functions are legacy APIs that do not always meet modern requirements.
  • CF APIs are not officially intended to be called directly from Swift on non-Darwin platforms.

Proposed solution

The solution is straightforward.
We introduce a computed property that returns the name, and the initializer that creates an instance from a name as shown below.

extension String.Encoding {
  /// The name of this encoding that is compatible with the one of the IANA registry "charset".
  public var name: String?

  /// Creates an instance from the name of the IANA registry "charset".
  public init?(name: String)
}

Detailed design

This proposal refers to "Character Sets" published by IANA because CF APIs do so.
However, as mentioned above, CF APIs are sometimes out of step with the times.
Therefore, we need to adjust it to some extent:

Graph of Encodings ↔︎ Names
The graph of String.Encoding-Name conversions

String.Encoding to Name

  • Upper-case letters may be used unlike CF.
    • var name returns Preferred MIME Name or Name of the encoding defined in "IANA Character Sets".

Name to String.Encoding

  • init(name:) adopts "Charset Alias Matching" defined in UTX#22.
    • i.g., "u.t.f-008" is recognized as "UTF-8".
  • init(name:) behaves consistently about ISO-8859-*.
    • For example, CF inconsistently handles "ISO-8859-1-Windows-3.1-Latin-1" and "csWindows31Latin1".
    • "ISO-8859-1-Windows-3.0-Latin-1" is a subset of "windows-1252", not of "ISO-8859-1".[1]
    • "ISO-8859-1-Windows-3.1-Latin-1" is a subset of "windows-1252", not of "ISO-8859-1".[2]
    • "ISO-8859-2-Windows-Latin-2" is a subset of "windows-1250", not of "ISO-8859-2".[3]
    • "ISO-8859-9-Windows-Latin-5" is a subset of "windows-1254", not of "ISO-8859-9".[4]

Rationales for controversial points

  • While "ISO_646.irv:1983"(a.k.a. "Code page 1009") is resolved into .ascii by CF, it is, strictly speaking, incompatible with "US-ASCII".
    This proposal decides that String.Encoding can't be initialized from "ISO_646.irv:1983".
  • "CP51932" was regarded as a variant of "EUC-JP" formulated by Microsoft.
    It was, however, intended to be used mainly by web browsers (i.e. Internet Explorer considering the historical background) on Windows.
    As a result, it is incompatible with the original "EUC-JP" widely used on UNIX.
    Consequently, "CP51932" should not be associated with .japaneseEUC.
  • "CP932" is no longer available for a name of any encodings. Consequently, String.Encoding.shiftJIS.name returns "Shift_JIS".
  • "Windows-31J" is a variant of "Shift_JIS" extended by Microsoft.
    For historical reasons, String.Encoding.shiftJIS is an encoding equivalent to kCFStringEncodingDOSJapanese in CF (not to kCFStringEncodingShiftJIS), which means that .shiftJIS should be created from the name "Windows-31J" as well.

Source compatibility

These changes proposed here are only additive. However, care must be taken if migrating from CF APIs.

Implications on adoption

This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.

Future directions

String.init(data:encoding:) and String.data(using:) will be implemented more appropriately[5].

Hopefully, happening some cascades like below might be expected in the longer term.

  • General string decoders/encoders and their protocols (for example, as suggested in "Unicode Processing APIs") could be implemented.

  • Some types which provide their names and decoders/encoders could be implemented for the purpose of tightness between names and implementations.

    • There would be a type for WHATWG Encoding Standard which defines both names and implementations.
They would look like...
public protocol StrawmanStringEncodingProtocol {
  static func encoding(for name: String) -> Self?
  var name: String? { get }
  var encoder: (any StringToByteStreamEncoder)? { get }
  var decoder: (any ByteStreamToUnicodeScalarsDecoder)? { get }
}

public struct IANACharset: StrawmanStringEncodingProtocol {
  public static let utf8: IANACharset = ...
  public static let shiftJIS: IANACharset = ...
  :
  :
}

public struct WHATWGEncoding: StrawmanStringEncodingProtocol {
  public static let utf8: WHATWGEncoding = ...
  public static let eucJP: WHATWGEncoding = ...
  :
  :
}
  • String.Encoding might be deprecated as a natural course in the distant future??

Alternatives considered

Adopting the WHATWG Encoding Standard (as well)

There is another standard for string encodings which is published by WHATWG: "Encoding Standard".
While it may claim the IANA's Character Sets could be replaced with it, it entirely focuses on Web browsers and their JavaScript APIs.
Furthermore it binds tightly names with implementations.
Since String.Encoding is just a RawRepresentable type where its RawValue is UInt, it is more universal but is more loosely bound to implementations.
As a result, WHATWG Encoding Standard doesn't easily align with String.Encoding. So it is just mentioned in "Future Directions".

Acknowledgments

Thanks to everyone who gave me advices on the pitch thread; especially to @benrimmington and @xwu who could channel their concerns into this proposal in the very early stage.


  1. https://www.pclviewer.com/resources/symbolset/pcl_9u.pdf ↩︎

  2. https://www.pclviewer.com/resources/symbolset/pcl_19u_V2.pdf ↩︎

  3. https://www.pclviewer.com/resources/symbolset/pcl_9e.pdf ↩︎

  4. https://www.pclviewer.com/resources/symbolset/pcl_5t.pdf ↩︎

  5. ☂️ `String.init(data:encoding:)`/`String.data(using:)` regression. · Issue #1015 · swiftlang/swift-foundation · GitHub ↩︎

My suggestion is to be precise about what use case(s) these proposed APIs are aimed at serving: our discussion has made it clear that there isn't one solution which fits all the uses that fall under, as you describe, "computer networking and other areas," and I wouldn't say that "the solution is straightforward" to address all of them.

For example, the WHATWG Encoding Standard is specifically geared towards the Web—which is clearly a very, very large use case. Taking as given that you want to subset those use cases out, and that compatibility with currently written code that uses CF APIs is also out, for what audience are these new APIs tailored? For instance:

What format(s) adopt UTX#22 in the manner implemented here such that the proposed APIs are a perfect fit?

If HTTP headers and XML documents are not such examples, then they should not be cited as motivation. If the motivation is parsing HTTP headers and XML documents, then the APIs presented should have behavior such that they can be used without qualification to handle them.

@xwu Thank you for your responses every time. (To be clear, I'm not sarcastic.)

My wording was bad, but "we don't aim at full compatibility with CF APIs" meant just something like CF-compatibility-with-bugfixes. In short, it's CF-based fundamentally.

This is the very example which indicates new API originates in CF (better or worse):

With import CoreFoundation,

CFStringConvertIANACharSetNameToEncoding("u.t.f.@008" as! CFString) == CFStringBuiltInEncodings.UTF8.rawValue

is true.

That being said, as you pointed out, XML specification, for example, requires only case-insensitive comparison. However, CF manner can parse XML's EncNames since case-insensitive comparison is a subset of UTX#22 rule... well, that's just "not strict".

I can understand what you argue, but I feel that's a bit of stretch.
As mentioned above, charset in HTTP headers and encoding in XML documents can be parsed by UTX#22 matching rule.
Given the fact that CFStringConvertIANACharsetNameToEncoding is widely used by not a few of projects (including swift-build), we can assume we are already tolerant toward UTX#22 to parse string encoding names.

If there's something we should discuss, I think it's about whether or not we stay tolerant.
(It's easier to implement case-insensitive comparison than to implement UTX#22 Charset Alias Matching rule, though...)

Thanks--my point is that, with parsing, I think we need to be very concrete here about specifying behavior. This means:

CF "with bugfixes" and "CF-based fundamentally" is not CF, and Hyrum's law dictates that this will trip up somebody if the use case is to replace calls to CF. Therefore, if the proposed motivation is to offer a replacement API for uses that currently require CF, each of these "bugfixes" and "fundamentally"s will have to be detailed exactly, documented, and justified. This is hard because it's not sufficient that a new behavior be "better" in blue-sky development—you'd have to look at existing uses and whether they rely on the old "buggy" behavior, and measure impact to those clients.

Also, remember that the Encoding Standard is merely "fixing" inconsistencies that have crept in over the years with practical usage of the IANA standard. I fear that the result of "compatibility-with-bugfixes" is just a new, non-standard "encoding standard" recreated on-the-fly, and I don't think this is the role of Foundation or the Swift open source project: we should adopt existing standards.

As the Encoding Standard details, laxity--behaviors not specified in a standard and not expected in practice--can lead to security holes.

If you were writing a state-of-the-art reference implementation of an XML document parser, would you use the rules in UTX#22? By the sounds of it, no (but maybe I'm wrong?). If not, what rules would you use? A proposal that has XML document parsing as a central motivation would adopt those rules.

Here too is an example of where I think we need precision, not assumption—Tolerant in what specific use case(s) and domains? Is this behavior intended? specified? desired?

If we take as given that we want to make "bugfixes" in this area—should we offer APIs that don't apply UTX#22? that apply some other standard behavior?

2 Likes

Uh, you are touching a sore spot... You are definitely right.
Honestly speaking, I would personally never get upset even if the initializer ends up adopting only case-insensitive comparison. It would suffice my use-cases. (In fact, it's adopted in the first pitch.)
I was thinking that there might be someone who relies on UTX#22, considering that it does exist and it is adopted by some implementations such as CFStringConvertIANACharsetNameToEncoding.


Before reconsidering the pitch, I dare to think about the theoretical rationales of adopting UTX#22 and of adopting case-insensitive comparison to organize my thoughts.
This is like my monologue which you don't need to read unless you have too much much free time. :slight_smile:

💭What's rationales?

◆ Rationale of adopting UTX#22

First, what we(I) learnt from WHATWG Encoding Standard is that a type representing a kind of string encodings can't stand alone.

If you get let myEncoding = MyFancyStringEncoding("EUC-JP"), it's not an end. You'll invariably want to do something like this: let string = myEncoding.decoder.decode(someData).
Namely, you need a decoder corresponding to that encoding.

By the way, String.Encoding doesn't provide its own decoders per se.
Where is the decoder?
It's String(bytes:encoding:) there.
String(bytes:encoding:) is bound to String.Encoding as its decoder.

How does String(bytes:encoding:) decode byte sequence?
It delegates decoding to ICU. [1]
That means String.Encoding is indirectly bound to ICU's decoder.

As a result, you can reckon that String.Encoding is a type which exposes ICU's internal encoding type.
If you think about it that way, it isn't strange but rather consistent that String.Encoding is instantiated in the same way as ICU's; That's UTX#22.
String-decoder already employs ICU, thus String.Encoding should follow that way.

This could answer this:

We have to adopt ICU's standard because it's current Foundation's de-facto standard.
(Though I'm noticing this couldn't answer concerns about name resolution itself.)


◆ Rationale of adopting case-insensitive comparison

In real world, we can't find specifications that require UTX#22's charset alias matching rule (so far), while XML, for example, requires only case-insensitive comparison.
To avert complexity and security risks if exist, we had better adopt case-insensitive matching strategy.
That aligns somewhat with Foundation's current internal implementations:


  1. To be precise, some encodings(US-ASCII, UTF-*, ISO-8859-1, and Mac OS Roman) are specialized, so other encodings are delegated to ICU: CFStringCreateWithBytes__CFStringCreateImmutableFunnel3 __CFStringDecodeByteStream3 CFStringEncodingGetConverter__CFGetConverter__CFStringEncodingConverterGetDefinition__CFStringEncodingGetExternalConverter. The comment there says "// we prefer Text Encoding Converter ICU since it's more reliable". ↩︎

CFStringConvertIANACharSetNameToEncoding(_:) returns

  • kCFStringEncodingShiftJIS when given underscored "Shift_JIS" or "shift_jis" names. Probably using __CFKnownEncodingList and __CFCanonicalNameList.

  • kCFStringEncodingDOSJapanese when given hyphenated "Shift-JIS" or "shift-jis" names. Probably using ICU.

In your previous version, you had the initializer named

public init?(ianaName: String)

But you removed "iana" from the latest version. Wouldn't this be ambiguous what "name" is supposed to refer to?

And if I understand it correctly, this is the reasoning?

This proposal refers to "Character Sets" published by IANA because CF APIs do so.

If so, I don't think new Swift API should be introduced to serve CF clients or to match CF history. Can you speak more to why you decided to drop "IANA" from the name?

I didn't notice...
"Shift_JIS" → kCFStringEncodingShiftJIS
"Shift__JIS" → kCFStringEncodingDOSJapanese

(And I'm not sure why __CFStringEncodingGetFromICUName calls ucnv_getStandardName for WINDOWS platform first instead of calling it for IANA...)


I'm sorry. That's surely self-contradiction.
I shoud have let it explicit with iana-prefix.
I was/am wavering because we shouldn't take over CF behavior as is meanwhile WHATWG standard doesn't fit with currently available Foundation's APIs.

This time, the new pitch (Pitch#6) suggests a simplest-but-enough manner:

  • One-to-one correspondence between String.Encoding instances and names.
  • Make it explicit that IANA "charset" names are used.
  • Abandon UTS#22-based parsing and adopt case-insensitive comparison.

Full text: