The documentation I found is very laconic... ("No overview
available.")
You should definitely file bugs wherever you find the docs lacking. Please post your bug numbers, just for the record.
What is the default encoding of String(contentsOf:)?
It’s kinda complex. On Apple platforms this defers to -[NSString initWithContentsOfURL:usedEncoding:error:]. While that is better documented it still won’t give you the concrete answers you’re looking for. That’s because the implementation is allowed to change to adjust for circumstances. The semantics of this initialiser is that it will take some system-specific steps to try to infer the encoding and then tell you what it inferred. Right now these including:
Looking at HTTP headers, for HTTP URLs
Looking at the com.apple.TextEncoding extended attribute
Sniffing the first few bytes of the file
but that could change.
Honestly, I avoid any NSString methods that don’t take encodings. An important gotcha here is that defaultCStringEncoding is not UTF-8 but MacRoman!
Also what does nonLossyASCII really mean?
This is equivalent to NSNonLossyASCIIStringEncoding. Admittedly, the documentation for that is rather sparse, but it’s easy to understand with an example:
let s1 = "Let’s not be naïve!"
let d = s1.data(using: .nonLossyASCII)!
let s2 = String(bytes: d, encoding: .ascii)!
print(s2) // Let\u2019s not be na\357ve!
I notice that the implementation of non-lossy encoding you point to allows for a uint8_t lossByte parameter. Characters that cannot be converted to the specified encoding are represented with the char specified by lossByte.
Is this lossByte character specifiable in any way when converting data in Swift?
In general, is there a encoder/decoder fallback mechanism in Swift as, for instance, in .NET to specify what to do with an input character which cannot be converted to the output encoding?
It's more precise to say that the function containing the implementation of non-lossy encoding takes a lossByte argument. But that function implements several different encodings. Some of the encodings use lossByte, but non-lossy encoding does not. (That is what makes it non-lossy!)
If you want a lossy encoding, and you want to pass lossByte, you can use CFStringGetBytes. Example:
import Foundation
let string = "Hello, 🌎!"
let cfString = string as CFString
// Compute the buffer size needed.
var bytesNeeded: CFIndex = 0
CFStringGetBytes(
cfString,
CFRangeMake(0, CFStringGetLength(cfString)),
CFStringBuiltInEncodings.ASCII.rawValue,
64 /* ASCII @ */,
false,
nil,
0,
&bytesNeeded
)
var buffer = [UInt8](repeating: 0, count: bytesNeeded)
// Fill the buffer.
buffer.withUnsafeMutableBufferPointer { buffer in
CFStringGetBytes(
cfString,
CFRangeMake(0, CFStringGetLength(cfString)),
CFStringBuiltInEncodings.ASCII.rawValue,
64 /* ASCII @ */,
false,
buffer.baseAddress,
buffer.count,
nil
)
}
print(buffer)
What is the relationship between "Unicode space" and ICU?
In this context I was using “Unicode space” to mean “convert the string to Unicode and then deal with the fallback tranformation there”. The alternative, trying to do the encoding conversion and the fallback transformation in one step, is trickier.
What part of ICU is available?
Apple platforms support all ICU transforms supported by the version of ICU that ships with the platform [1].
IMPORTANT ICU itself is not an API on Apple platforms. Then again, it’s not entirely an implementation detail either, as witness by this string transform stuff.
Share and Enjoy
Quinn “The Eskimo!” @ DTS @ Apple
[1] I’m pretty sure that we have a legacy doc that lists the ICU version for each iOS / macOS version but I wasn’t able to track it down )-: