Endianness when using String(decoding: collection, as: UTF16.self)


(Sergej Jaskiewicz) #1

I was wondering how the String(decoding:as:) initializer from the standard library (not Foundation) handles endianness.

I have the following code:

let bigEndian: [UInt8] = [0xFE, 0xFF, 0x00, 0x48, 0x00, 0x65, 0x00, 0x6C, 0x00, 0x6C, 0x00, 0x6F]
let littleEndian: [UInt8] = [0xFF, 0xFE, 0x48, 0x00, 0x65, 0x00, 0x6C, 0x00, 0x6C, 0x00, 0x6F, 0x00]

print(littleEndian.withUnsafeBytes { raw in
  String(
    decoding: raw.bindMemory(to: UInt16.self),
    as: UTF16.self
  )
}) // "Hello"

print(bigEndian.withUnsafeBytes { raw in
  String(
    decoding: raw.bindMemory(to: UInt16.self),
    as: UTF16.self
  )
}) // "䠀攀氀氀漀"

Is there any way to force that initalizer to decode the bytes as big endian if my host system is little endian? I tried to explore the stdlib source code but had a hard time understanding all that bitwise operations trickery.

And yes, I know that I can use the Foundations's String(data: bigEndian, encoding: .utf16BigEndian), but I'm writing a Linux library and don't want to depend on Foundation.


(Michael Ilseman) #2

This would follow the platform endianness by default. FixedWidthInteger provides bigEndian and littleEndian which does a byte swap only if it differs from the platform endianness. Here's how to do the decoding agnostic of your platform endianness when you know the bytes are big endian.

// Eager
print(bigEndian.withUnsafeBytes { raw in
  String(
    decoding: raw.bindMemory(to: UInt16.self).map { $0.bigEndian },
    as: UTF16.self
  )
}) // "Hello"

// Lazy (skip intermediary Array)
print(bigEndian.withUnsafeBytes { raw in
  String(
    decoding: raw.bindMemory(to: UInt16.self).lazy.map { $0.bigEndian },
    as: UTF16.self
  )
}) // "Hello"

So you could add something like:

extension String {
  public init<T: FixedWidthInteger, Encoding: Unicode.Encoding>(
    decodingBigEndianBytes bytes: [T], as encoding: Encoding.Type
  ) {
    self = bytes.withUnsafeBytes { raw in
      String(
        decoding: raw.bindMemory(to: Encoding.CodeUnit.self).lazy.map { $0.bigEndian },
        as: encoding)
    }
  }
}

(Sergej Jaskiewicz) #3

Thank you!

Now I'm wondering if this could (should?) belong to the standard library.

Or at least can be better documented to avoid confusion I found myself in :)


(Michael Ilseman) #4

To put the initializer in the standard library, I think we'd want to generalize it off of Array and to any Collection of FixedWidthIntegers. Using withUnsafeBytes can be an implementation optimization when supported (we already do similar things for the normal decoding init, but that has the Element constrained to CodeUnit).

String has a lot of initializers and could use some spring cleaning in Swift 5.x, so this might be possible as part of an overhaul. Could you file a bug and/or make a SE pitch?


(Sergej Jaskiewicz) #5

Okay! Do you think it's better done now or later, since I guess the team has more important things to do as the Swift 5.0 release is getting closer?


(Michael Ilseman) #6

This will not make Swift 5.0, but could be part of a future release. The main challenge will be deciding how this API should look and gel with the other String initializers.