What is the default encoding of String(contentsOf:)?

PatrickA · July 13, 2020, 7:36am

I'm trying to find documentation which specifies the default encoding of String(contentsOf:)

The documentation I found is very laconic... ("No overview available.")

https://developer.apple.com/documentation/swift/string/3126735-init

Also what does nonLossyASCII really mean?

Again this page is very terse...

https://developer.apple.com/documentation/swift/string/encoding/1780349-nonlossyascii

Is there a better (official) source of documentation?

eskimo · July 13, 2020, 8:43am

The documentation I found is very laconic... ("No overview
available.")

You should definitely file bugs wherever you find the docs lacking. Please post your bug numbers, just for the record.

What is the default encoding of String(contentsOf:)?

It’s kinda complex. On Apple platforms this defers to -[NSString initWithContentsOfURL:usedEncoding:error:]. While that is better documented it still won’t give you the concrete answers you’re looking for. That’s because the implementation is allowed to change to adjust for circumstances. The semantics of this initialiser is that it will take some system-specific steps to try to infer the encoding and then tell you what it inferred. Right now these including:

Looking at HTTP headers, for HTTP URLs
Looking at the com.apple.TextEncoding extended attribute
Sniffing the first few bytes of the file

but that could change.

Honestly, I avoid any NSString methods that don’t take encodings. An important gotcha here is that defaultCStringEncoding is not UTF-8 but MacRoman!

Also what does nonLossyASCII really mean?

This is equivalent to NSNonLossyASCIIStringEncoding. Admittedly, the documentation for that is rather sparse, but it’s easy to understand with an example:

let s1 = "Let’s not be naïve!"
let d = s1.data(using: .nonLossyASCII)!
let s2 = String(bytes: d, encoding: .ascii)!
print(s2) // Let\u2019s not be na\357ve!

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

mayoff · July 13, 2020, 1:24pm

You can find an implementation of non-lossy ASCII encoding here:

__CFStringEncodeByteStream

and an implementation of decoding here:

__CFStringDecodeByteStream3

PatrickA · July 13, 2020, 2:59pm

Thank you all for your helpful replies.

PatrickA · September 4, 2020, 4:11pm

I notice that the implementation of non-lossy encoding you point to allows for a uint8_t lossByte parameter. Characters that cannot be converted to the specified encoding are represented with the char specified by lossByte.

Is this lossByte character specifiable in any way when converting data in Swift?

In general, is there a encoder/decoder fallback mechanism in Swift as, for instance, in .NET to specify what to do with an input character which cannot be converted to the output encoding?

mayoff · September 4, 2020, 4:53pm

It's more precise to say that the function containing the implementation of non-lossy encoding takes a lossByte argument. But that function implements several different encodings. Some of the encodings use lossByte, but non-lossy encoding does not. (That is what makes it non-lossy!)

If you want a lossy encoding, and you want to pass lossByte, you can use CFStringGetBytes. Example:

import Foundation

let string = "Hello, 🌎!"
let cfString = string as CFString

// Compute the buffer size needed.
var bytesNeeded: CFIndex = 0
CFStringGetBytes(
    cfString,
    CFRangeMake(0, CFStringGetLength(cfString)),
    CFStringBuiltInEncodings.ASCII.rawValue,
    64 /* ASCII @ */,
    false,
    nil,
    0,
    &bytesNeeded
)

var buffer = [UInt8](repeating: 0, count: bytesNeeded)
// Fill the buffer.
buffer.withUnsafeMutableBufferPointer { buffer in
    CFStringGetBytes(
        cfString,
        CFRangeMake(0, CFStringGetLength(cfString)),
        CFStringBuiltInEncodings.ASCII.rawValue,
        64 /* ASCII @ */,
        false,
        buffer.baseAddress,
        buffer.count,
        nil
    )
}

print(buffer)

Output:

[72, 101, 108, 108, 111, 44, 32, 64, 64, 33]

PatrickA · September 4, 2020, 6:11pm

Thank you Rob for your answer with the code snippet. Most useful.

I gather there is no mechanism available in Swift where one may supply a custom conversion fallback mechanism to say for instance that

"©" should be replaced by "c" or, if strings are allowed, "(c) ",
"®" by "r" or "(r)" ,
etc.

eskimo · September 6, 2020, 9:46pm

I gather there is no mechanism available in Swift where one may supply
a custom conversion fallback mechanism

Not directly, but you could work in Unicode space and apply the Any-Publishing string transform. For example:

import Foundation

let sss = "Hello ©®uel World!"
print(sss.applyingTransform(StringTransform("Any-Publishing"), reverse: true)!)
// prints: Hello (C)(R)uel World!

For more info on Unicode transforms, see the Transforms section of the ICU docs.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

PatrickA · September 21, 2020, 5:36pm

Thank you Eskimo. Very interesting, learning a lot.

What is the relationship between "Unicode space" and ICU? What part of ICU is available?

eskimo · September 22, 2020, 8:51am

What is the relationship between "Unicode space" and ICU?

In this context I was using “Unicode space” to mean “convert the string to Unicode and then deal with the fallback tranformation there”. The alternative, trying to do the encoding conversion and the fallback transformation in one step, is trickier.

What part of ICU is available?

Apple platforms support all ICU transforms supported by the version of ICU that ships with the platform [1].

IMPORTANT ICU itself is not an API on Apple platforms. Then again, it’s not entirely an implementation detail either, as witness by this string transform stuff.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] I’m pretty sure that we have a legacy doc that lists the ICU version for each iOS / macOS version but I wasn’t able to track it down )-: