Can this string.data(using: .utf8)! ever fail? Is it safe to use it for any user input?
Theoretically, yes. In reality, I've never encountered it yet :)
However, if you want to get just bytes from string, you can always do this: let myBytes = [UInt8](myString.utf8) (and then Data(myBytes) if you really need it as Data).
It can fail, for example with this broken NSString:
import Foundation
let x = "๐โโ๏ธ" as NSString
// this will create a broken string by using NSString's UTF16 offsets which is illegal here as we're splitting within one surrogate pair.
let bad = x.substring(with: NSRange(location: 0, length: 1)).data(using: .utf8)
print(bad.debugDescription) // will print 'nil'
if you want something that can't fail, use String.utf8 which yield a UTF8View which will use Unicode's replacement character in case you have some illegal string. If you want a Data containing the UTF8 bytes from a String, you could use
import Foundation
let x = "๐โโ๏ธ" as NSString
// this will create a broken string by using NSString's UTF16 offsets which is illegal here as we're splitting within one surrogate pair.
let badString = x.substring(with: NSRange(location: 0, length: 1))
print(Data(badString.utf8)) // will print '3 bytes'
which can't fail. The 3 bytes you'll find in there are this replacement character.
(I know this topic is old, but think it should be updated.)
From Swift 5(5.1), string.data(using: .utf8) is the very same with Data(string.utf8) since swift#24215, swift#24239, and swift-corelibs-foundation#2173 were merged.
Therefore, bad.debugDescription in @johannesweiss's first example above is not nil any longer in recent Swift.
From my experience, Swift's String behaves differently depending on where the string comes from.
If a string is a literal typed directly in the source code, itโs guaranteed to be valid Unicode. In my experience, transforming it to UTF-X data will never fail:
let data: Data = "Hello my World! ๐".data(using: .utf8)!
Where things could go wrong is after you manipulate the underlying bytes and accidentally damage the logical structure of the encoded unicode character. UTF-8 has rules. UTF-16 has rules, the same for other encodings. If we violate them, we are outside the safe zone.
The only case where I succeeded in producing a broken string was by stepping outside Swift and using NSString, exactly like @johannesweiss demonstrated. I havenโt managed to break a Swift's String using its own APIs.
The related subject is that String is protecting itself from wrong initial data, other than literals.
let raw: [UInt8] = [0xF0, 0x9F, 0x98, 0x80] // UTF-8 for ๐
let s = String(bytes: raw, encoding: .utf8) // "๐"
Invalid example:
let raw: [UInt8] = [0xF0] // UTF-8 for ๐
let s: String = String(bytes: raw, encoding: .utf8)! // "Failure"
In this second example, the byte sequence starts a multi-byte UTF-8 character but does not include the required continuation bytes.
- 0xF0 is the start byte of a 4-byte UTF-8 sequence.
- After a 0xF0, UTF-8 requires three continuation bytes (10xxxxxx)
I find it important to mention that itโs very easy to reach and mix Foundationโs NSString APIs with the Swift String API whenever import Foundation is present in a file. Also, the fact that Swift String (since Swift 5) is stored in memory as UTF-8 while NSString uses UTF-16 encoding should be more widely known.