How to convert [UInt8] into [UInt16]?

I have a big array (like several GB)

bigArray: ContiguousArray<UInt8> = ...

bigArray.size is even.

Now I want to access (read only) the content of bigArray as UInt16.

like:

let big16Array = bigArray as ContiguousArray<UInt16>
for uint16 in big16Array {sum += someFunc(uint16)}

This of course does not compile.

How can I do this?

Also I do not want the bigArray to be copied.

You can dance with unsafe pointers. Assuming you're certain about the endianness:

// input
let array: [UInt8] = [1, 2, 3, 4]

// the operation
func doSomethingWithUInt16s(_ collection: some RandomAccessCollection<UInt16>) {
    print(collection.count)
}

// the dance
array.withUnsafeBufferPointer { buffer in
    buffer.withMemoryRebound(to: UInt16.self) { buffer in
        doSomethingWithUInt16s(buffer)
    }
}

Note that the code does not convert [UInt8] into a [UInt16] as you've requested in the title, because that will always create a copy, which is what you essentially want to avoid. So it uses a RandomAccessCollection instead.

If you're willing to write a touch more code, you can do this and avoid the unsafe code entirely:

struct ProjectingToUInt16Collection<Base>: RandomAccessCollection where Base: RandomAccessCollection, Base.Element == UInt8 {
    private var base: Base

    fileprivate init(wrapping base: Base) {
        precondition(base.count % 2 == 0)
        self.base = base
    }

    var startIndex: Base.Index {
        self.base.startIndex
    }

    var endIndex: Base.Index {
        self.base.endIndex
    }

    func index(after original: Base.Index) -> Base.Index {
        self.base.index(original, offsetBy: 2)
    }

    func index(before original: Base.Index) -> Base.Index {
        self.base.index(original, offsetBy: -2)
    }

    func index(_ original: Base.Index, offsetBy distance: Int) -> Base.Index {
        return self.base.index(original, offsetBy: distance * 2)
    }

    func distance(from start: Base.Index, to end: Base.Index) -> Int {
        return self.base.distance(from: start, to: end) / 2
    }

    subscript(index: Base.Index) -> UInt16 {
        let lowNibble = self.base.index(after: index)
        return UInt16(self.base[index]) << 8 | UInt16(self.base[lowNibble])
    }
}

extension RandomAccessCollection where Element == UInt8 {
    func projectingToUInt16() -> ProjectingToUInt16Collection<Self> {
        .init(wrapping: self)
    }
}

This has the advantage of being fully generic, and the disadvantage of being a little more code. With slightly more work this can be used to project a collection of any fixed-width integer to any other fixed-width-integer, though the subscript operator gets a little harder.

9 Likes

I think this should be:

subscript(index: Base.Index) -> UInt16 {
  let lowByte = base.index(after: index)
  return UInt16(base[index]) << 8 | UInt16(base[lowByte])
}

…or possibly that should be highByte and the order swapped, depending on OP’s intended endianness.

And if desired you could also throw in a subscript setter when the base is mutable.

3 Likes

Good catch, yes, the shift should be by 8 not 16. I'll edit the above to fix that.

Yeah, there are a lot of enhancements to this to make it fully general. Conditional mutablecollection conformance, etc.

For the pointer code: technically you should not be assuming that the storage for an array of UInt8 is properly aligned for an array of UInt16. But it will be in practice with Swift’s Array for the foreseeable future. Still, don’t assume this will work for arbitrary buffers, like Data.

7 Likes
import Algorithms

array.chunks(ofCount: 2).lazy.map { 
    UInt16($0[$0.startIndex]) << 8 | UInt16($0[$0.index(after: $0.startIndex)]) 
}

(Big-endian in this case.)

Oh, I have lore about this exact problem…

Anyway, there's a pretty major tradeoff here between performance and safety. I suspect a sequence-based approach like @lukasa's might be notably slower than one that does a pointer dance (which is generally free after optimizations.) However, if the input array is misaligned, you'll crash with a recast pointer on some architectures.

@CrystDragon, @lukasa: Tried your methods, but when indexed they seem to yield different results.

[1, 2, 3, 4]
2 UnsafeBufferPointer(start: 0x0000600002de85f0, count: 2)
first last: 513 1027
first last: 513 1027

[1, 2, 3, 4]
2 ProjectingToUInt16Collection<Array<UInt8>>(base: [1, 2, 3, 4])
first last: 258 772
first last: 258 772
Program ended with exit code: 0

Am I doing something wrong?

Driver
//  Driver.swift
//  UInt8ArrayToUInt16Array
//
//  Created by ibex on 19/3/2025.
//

// [https://forums.swift.org/t/how-to-convert-uint8-into-uint16/78624]

import Foundation

@main
enum Driver {
    static func main () {
        do {
            let u: [UInt8] = [1, 2, 3, 4]
            RebindMemory.test (array: u)
            print ()
            Project.test (array: u)
        }
    }
}

M1
//  RebindMemory.swift
//  UInt8ArrayToUInt16Array
//
//  Created by ibex on 19/3/2025.
//
// [https://forums.swift.org/t/how-to-convert-uint8-into-uint16/78624/2]

struct RebindMemory {
    static func test (array: [UInt8]) {
        Self (array: array).convert ()
    }
    
    // input
    let array: [UInt8]

    // the operation
    func doSomethingWithUInt16s (_ u: some RandomAccessCollection<UInt16>) {
        print (array)
        print (u.count, u)
        print ("first last:", u.first!, u.last!)
        print ("first last:", u [u.startIndex], u [u.index (before: u.endIndex)])
    }

    func convert () {
        array.withUnsafeBufferPointer { buffer in
            buffer.withMemoryRebound (to: UInt16.self) { buffer in
                doSomethingWithUInt16s (buffer)
            }
        }
    }
}
M2
//  Project.swift
//  UInt8ArrayToUInt16Array
//
//  Created by ibex on 19/3/2025.
//

// [https://forums.swift.org/t/how-to-convert-uint8-into-uint16/78624/3]

struct Project {
    static func test (array: [UInt8]) {
        print (array)
        
        let u = array.projectingToUInt16()
        print (u.count, u)
        print ("first last:", u.first!, u.last!)
        print ("first last:", u [u.startIndex], u [u.index (before: u.endIndex)])
    }
}

struct ProjectingToUInt16Collection<Base>: RandomAccessCollection where Base: RandomAccessCollection, Base.Element == UInt8 {
    private var base: Base

    fileprivate init(wrapping base: Base) {
        precondition(base.count % 2 == 0)
        self.base = base
    }

    var startIndex: Base.Index {
        self.base.startIndex
    }

    var endIndex: Base.Index {
        self.base.endIndex
    }

    func index(after original: Base.Index) -> Base.Index {
        self.base.index(original, offsetBy: 2)
    }

    func index(before original: Base.Index) -> Base.Index {
        self.base.index(original, offsetBy: -2)
    }

    func index(_ original: Base.Index, offsetBy distance: Int) -> Base.Index {
        return self.base.index(original, offsetBy: distance * 2)
    }

    func distance(from start: Base.Index, to end: Base.Index) -> Int {
        return self.base.distance(from: start, to: end) / 2
    }

    subscript(index: Base.Index) -> UInt16 {
        let lowNibble = self.base.index(after: index)
        return UInt16(self.base[index]) << 8 | UInt16(self.base[lowNibble])
    }
}

extension RandomAccessCollection where Element == UInt8 {
    func projectingToUInt16() -> ProjectingToUInt16Collection<Self> {
        .init(wrapping: self)
    }
}

To get them yield the same values, I had to do some nibble dancing in M1 . :slight_smile:

struct RebindMemory {
    static func test (array: [UInt8]) {
        Self (array: array).convert ()
    }
    
    // input
    let array: [UInt8]

    // the operation
    func doSomethingWithUInt16s (_ u: some RandomAccessCollection<UInt16>) {
        print (array)
        print (u.count, u)
        
        func swapNibbles (of u: UInt16) -> UInt16 {
            let u0: UInt16 = u & 0x00FF
            let u1: UInt16 = u & 0xFF00
            return u0 << 8 | u1 >> 8
        }
        
        print ("first last:", swapNibbles (of: u.first!), swapNibbles (of: u.last!))
        print ("first last:", swapNibbles (of: u [u.startIndex]), swapNibbles (of: u [u.index (before: u.endIndex)]))
    }

    func convert () {
        array.withUnsafeBufferPointer { buffer in
            buffer.withMemoryRebound (to: UInt16.self) { buffer in
                doSomethingWithUInt16s (buffer)
            }
        }
    }
}
[1, 2, 3, 4]
2 UnsafeBufferPointer(start: 0x0000600002cb85f0, count: 2)
first last: 258 772
first last: 258 772

[1, 2, 3, 4]
2 ProjectingToUInt16Collection<Array<UInt8>>(base: [1, 2, 3, 4])
first last: 258 772
first last: 258 772
Program ended with exit code: 0

Yes, that's why I said

Assuming you're certain about the endianness:

On little endian machines, like modern Apple machines, Interpreting a (a: UInt8, b: UInt8) sequence directly to a UInt16 creates a value b <<8 + a. This may or may not match the intrinsic endianness of the input stream.

1 Like

I agree we should not assume alignment of the underlying buffer. However, I'm not sure about the safety problem on Array.

Correct me if I'm wrong, I think there's a difference between Array and Data. For Array, there's this withUnsafeBufferPointer method, according to its documentation, it will always correctly vend a UnsafeBufferPointer, explicit bound to Element type. On the other hand, Data now only has methods that vend UnsafeRawBufferPointer.

I don’t understand your question. Whether you have a buffer of typed UInt8 or a blob of bytes, if you want to interpret or reinterpret it as a buffer of typed UInt16 you have to think about alignment. Array’s alignment comes from the implementation details that are part of the stable stdlib ABI on Apple platforms and unlikely to change in practice on other platforms, but it’s not a generalizable principle. I just didn’t want to say “there’s a problem here!” and have someone go off and then be very confused why they can’t run into it in practice.

1 Like

Thanks for your reply, my question is simple:

If some standard library function provides me a UnsafeBufferPointer<T> where T is a POD, can I always assume this buffer can be safely interpreted into a UnsafeBufferPointer<S> when both of the following 2 conditions are true?

    1. S is also a POD.
    1. MemoryLayout<T> and MemoryLayout<S> are compatible with each other.

If this is true, then it does not matter how a UnsafeBufferPointer<T> is retrieved, as long as it's directly from a stable public API. Am I right?

I know it's tempting to manually swap bytes, but this way lies architecture-dependent code. The integer types in the standard library have littleEndian and bigEndian properties which do the right thing regardless of your architecture. This allows you to write your doSomethingWithUInt16s portably.

On a little endian machine:

  1> UInt16(1).littleEndian
$R0: UInt16 = 1
  2> UInt16(1).bigEndian
$R1: UInt16 = 256
3 Likes

I know, but thanks for reminding me. :slight_smile:

I did that because I was trying to reconcile @CrystDragon's example code to @lukasa's example code:

subscript(index: Base.Index) -> UInt16 {
    let lowNibble = self.base.index(after: index)
    return UInt16(self.base[index]) << 8 | UInt16(self.base[lowNibble])
}

After taking your advice into consideration, @lukasa's example code becomes (portable):

subscript(index: Base.Index) -> UInt16 {
    let lowNibble = self.base.index(after: index)

    if 3 == 3.bigEndian {
        return UInt16(self.base[index]) << 8 | UInt16(self.base[lowNibble])
    }
    else {
        return UInt16(self.base[index]) | UInt16(self.base[lowNibble]) << 8
    }
}

Sort of. Types that are BitwiseCopyable can be safely read into and out of a buffer, but you can't safely do that from arbitrary bytes unless the type is also fully inhabited, which we don't have a layout constraint for (yet?).

For instance, Bool is BitwiseCopyable, but not fully inhabited. So you can memcpy a Bool value into memory, and then memcpy it back into a Bool later, but if you memcpy a 3 into a Bool it's instant UB and nasal demons can appear.

3 Likes

File format?

In my experience it is quite rare to write if bigEndian { … } else { … }.

Most of the times this whole 2 * UInt8 -> UInt16 dance is caused by a binary file format that asks us to interpret bytes from 100 to 140 as [Int16]. In this case the format specifies whether the numbers are stored as little/big endian.

With this you do not need to write the if bigEndian { … } else { … }, you can just hard-code the endianness directly in the decoding function. (Hopefully with the the comment that links to the relevant section of the documentation/standard.)

For example the protobufs documentation states (emphasis mine):

How do you figure out that this is 150? First you drop the MSB from each byte, as this is just there to tell us whether we’ve reached the end of the number (as you can see, it’s set in the first byte as there is more than one byte in the varint). These 7-bit payloads are in little-endian order. Convert to big-endian order, concatenate, and interpret as an unsigned 64-bit integer:

10010110 00000001        // Original inputs.
 0010110  0000001        // Drop continuation bits.
 0000001  0010110        // Convert to big-endian.
   00000010010110        // Concatenate.
 128 + 16 + 4 + 2 = 150  // Interpret as an unsigned 64-bit integer.

When implementing the protobuf decoder you can just hard-code the algorithm above. No need to check endianness.

It may happen that your file format contains a flag that dictates the endianness. In such case you use the value supplied in the file.

The big question is: where did you get this data? What does it represent? What does the documentation/standard says?

Is [UInt8] aligned to UInt16?

Personally I always see such cases as: we have a 50% chance that [UInt8] is UInt16 aligned. 50% is not suitable for production. (This is a simplification, but I feel it is better than relying on some internal implementation details of Swift runtime.)

At the same time a single UInt8 in a [UInt8] is always UInt8 aligned. Thus I will always write the decoding step manually, possibly hard coding the endianness according to the spec.

Some may say that this is slower than the reinterpret-cast, but it is definitely safer. Performance can be improved later, possibly with the use of reinterpret-cast. This would be a separate ticket that requires an additional research and possibly a deep dive into the Swift source code. This approach gives your manager more data on what you did during your working hours:

  • [IMPL-007] Implemented the file format - 24h
  • [IMPL-008] Improved [UInt16] decoding performance - 12h

Final

+1 on the wrapper suggested by @lukasa. Personally I would simplify it to just the bare essentials:

  • remove generic
  • remove RandomAccessCollection

Unless you actually need those features.

struct UInt16View {

  private let bytes: [UInt8] // Foundation.Data/UnsafePointer whatever you have

  var count: Int { self.bytes.count / 2 }

  init(bytes: [UInt8]) {
    // Do not use '%'. Swift has 'isMultiple' for exactly this reason.
    precondition(bytes.count.isMultiple(of: 2), "…")
    self.bytes = bytes
  }

  subscript(index: Int) -> UInt16 {
    // Implement based on the standard/documentation.
    let high = self.bytes[2 * index]
    let low = self.bytes[2 * index + 1]
    return UInt16(high) << 8 | UInt16(low)
  }
}

If the endianness is supplied in the file and known only at runtime, then just provide it in init and store as a property:

init(bytes: [UInt8], endian: Endian) { … }
4 Likes