Full and Latest version of the proposal is on gist: NNNN-String-Encoding-Names.md · GitHub
IANA Character Set Names
- Proposal: Not assigned yet
- Author(s): YOCKOW
- Review Manager: TBD
- Status: Pitch
- Implementation: apple/swift-foundation#915
- Review: (Pitch)
Introduction
This proposal lets String.Encoding
be converted to/from the name of IANA Character Set.
For example:
print(String.Encoding.utf8.ianaCharacterSetName!) // Prints "utf-8"
print(String.Encoding(ianaCharacterSetName: "ISO-10646-UCS-4")! == .utf32) // Prints "true"
Motivation
The names for string encodings officially published by IANA registry are commonly used certainly in computer networking and in other areas.
You will often find them, for example, in HTTP headers such as Content-Type: text/plain; charset=UTF-8
("UTF-8" is the one). You will also find them in XML documents such as <?xml version="1.0" encoding="Shift_JIS" ?>
("Shift_JIS" is the one).
As a natural consequence, it is necessary to parse and to generate the names of "charset", for example, when you generate/receive HTTP response.
Current solution
Swift is missing such APIs, therefore we have to use functions defined in CoreFoundation
described as below.
String.Encoding
→ Name
let encoding: String.Encoding = ...
// 1. Convert `String.Encoding` value to the `CFStringEncoding` value.
// NOTE: The raw value of `String.Encoding` is the same with the value of `NSStringEncoding`,
// while it is not equal to the value of `CFStringEncoding`.
let cfStrEncValue: CFStringEncoding = CFStringConvertNSStringEncodingToEncoding(encoding.rawValue)
// 2. Convert it to the name where its type is `CFString?`
let cfStrEncName: CFString? = CFStringConvertEncodingToIANACharSetName(cfStrEncValue)
// 3. Convert `CFString` to Swift's `String`.
// NOTE: Unfortunately they can not be implicitly casted on Linux.
let charsetName: String? = cfStrEncName.flatMap {
let bufferSize = CFStringGetMaximumSizeForEncoding(
CFStringGetLength($0),
134217984 // utf-8
) + 1
let buffer = UnsafeMutablePointer<CChar>.allocate(capacity: bufferSize)
defer {
buffer.deallocate()
}
guard CFStringGetCString($0, buffer, bufferSize, 134217984) else {
return nil
}
return String(utf8String: buffer)
}
Name → String.Encoding
let charsetName: String = ...
// 1. Convert `String` to `CFString`
let cfStrEncName: CFString = charsetName.withCString { (cString: UnsafePointer<CChar>) -> CFString in
return CFStringCreateWithCString(nil, cString, 134217984)
}
// 2. Convert it to `CFStringEncoding`
let cfStrEncValue: CFStringEncoding = CFStringConvertIANACharSetNameToEncoding(cfStrEncName)
// 3. Check whether or not it's valid
guard cfStrEncValue != kCFStringEncodingInvalidId else {
fatalError("Invalid Name.")
}
// 4. Convert `CFStringEncoding` value to `String.Encoding` value
let encoding = String.Encoding(rawValue: CFStringConvertEncodingToNSStringEncoding(cfStrEncValue))
What's the problem of current solution?
- It is complicated to use multiple CF-functions to get a simple value. That's not Swifty.
- CoreFoundation APIs are officially unavailable from Swift on non-Darwin platforms.
Proposed solution
Solution is simple.
We introduce a computed property that returns the name of IANA character set and an initializer that creates an instance from a name as below.
extension String.Encoding {
public var ianaCharacterSetName: String? { get }
public init?(ianaCharacterSetName: String)
}
Detailed design
CF Compatibility
As CF-functions such as CFStringConvertEncodingToIANACharSetName
is widely used, this proposal maintains compatibility with CF-functions.
For example, String.Encoding.utf8.ianaCharacterSetName
returns lowercased "utf-8" rather than "UTF-8" because CFStringConvertEncodingToIANACharSetName(0x08000100 /* kCFStringEncodingUTF8 */ )
returns "utf-8" alike.
Value to Name
Here is a table that shows which value is converted to the specific name.
var ianaCharacterSetName: String? { get }
follows this rule.
[TABLE OMITTED: See gist]
Name to Value
Here is a table that shows which name is converted to the specific value of string encoding type.
init?(ianaCharacterSetName:)
follows this rule case-insensitively.
[TABLE OMITTED: See gist]
Static properties
There are some missing static properties in String.Encoding
that correspond to kCFStringEncoding*
.
In order to make the values of String.Encoding
initialized with init?(ianaCharacterSetName:)
meaningful, we add static properties to String.Encoding
:
[CODE OMITTED: See gist]
Source compatibility
These changes proposed here are only additive, and also designed for maintaining compatibility with CF-functions.
Implications on adoption
This feature can be freely adopted and un-adopted in source code with no deployment constraints and without affecting source compatibility.
Future directions
As one possible case, String.init(data:encoding:)
and string.data(using:)
support natively more string encodings added here.
Alternatives considered
Elide iana
prefix
We may elide iana
prefix from the computed property and the label of the initializer.
However, "Character Set" has historically a different meaning in Foundation
. That is why we should add iana
prefix to CharacterSetName
.