Can encoding String to Data with utf8 fail?

Can this string.data(using: .utf8)! ever fail? Is it safe to use it for any user input?

Theoretically, yes. In reality, I've never encountered it yet :)

However, if you want to get just bytes from string, you can always do this: let myBytes = [UInt8](myString.utf8) (and then Data(myBytes) if you really need it as Data).

2 Likes

It can fail, for example with this broken NSString:

import Foundation

let x = "๐Ÿ’‡โ€โ™€๏ธ" as NSString
// this will create a broken string by using NSString's UTF16 offsets which is illegal here as we're splitting within one surrogate pair.
let bad = x.substring(with: NSRange(location: 0, length: 1)).data(using: .utf8)
print(bad.debugDescription) // will print 'nil'

if you want something that can't fail, use String.utf8 which yield a UTF8View which will use Unicode's replacement character in case you have some illegal string. If you want a Data containing the UTF8 bytes from a String, you could use

import Foundation

let x = "๐Ÿ’‡โ€โ™€๏ธ" as NSString
// this will create a broken string by using NSString's UTF16 offsets which is illegal here as we're splitting within one surrogate pair.
let badString = x.substring(with: NSRange(location: 0, length: 1))
print(Data(badString.utf8)) // will print '3 bytes'

which can't fail. The 3 bytes you'll find in there are this replacement character.

5 Likes

(I know this topic is old, but think it should be updated.)

From Swift 5(5.1), string.data(using: .utf8) is the very same with Data(string.utf8) since swift#24215, swift#24239, and swift-corelibs-foundation#2173 were merged.

Therefore, bad.debugDescription in @johannesweiss's first example above is not nil any longer in recent Swift.

7 Likes

From my experience, Swift's String behaves differently depending on where the string comes from.

If a string is a literal typed directly in the source code, itโ€™s guaranteed to be valid Unicode. In my experience, transforming it to UTF-X data will never fail:

let data: Data = "Hello my World! ๐Ÿ˜€".data(using: .utf8)!

Where things could go wrong is after you manipulate the underlying bytes and accidentally damage the logical structure of the encoded unicode character. UTF-8 has rules. UTF-16 has rules, the same for other encodings. If we violate them, we are outside the safe zone.

The only case where I succeeded in producing a broken string was by stepping outside Swift and using NSString, exactly like @johannesweiss demonstrated. I havenโ€™t managed to break a Swift's String using its own APIs.

The related subject is that String is protecting itself from wrong initial data, other than literals.

let raw: [UInt8] = [0xF0, 0x9F, 0x98, 0x80] // UTF-8 for ๐Ÿ˜€
let s = String(bytes: raw, encoding: .utf8) // "๐Ÿ˜€"

Invalid example:

let raw: [UInt8] = [0xF0] // UTF-8 for ๐Ÿ˜€
let s: String = String(bytes: raw, encoding: .utf8)! // "Failure"

In this second example, the byte sequence starts a multi-byte UTF-8 character but does not include the required continuation bytes.

  • 0xF0 is the start byte of a 4-byte UTF-8 sequence.
  • After a 0xF0, UTF-8 requires three continuation bytes (10xxxxxx)

I find it important to mention that itโ€™s very easy to reach and mix Foundationโ€™s NSString APIs with the Swift String API whenever import Foundation is present in a file. Also, the fact that Swift String (since Swift 5) is stored in memory as UTF-8 while NSString uses UTF-16 encoding should be more widely known.