Bitfields in swift seem really unsafe

oftentimes in programming we define types that contain smaller types. here’s an example of a small (4 byte) type:

/// A zero-indexed grid position within a source file. This type
/// can represent positions in files up to one million lines long,
/// and 4096 characters wide.
@frozen public
struct SourcePosition:RawRepresentable, Equatable, Hashable, Sendable
{
    /// Contains the line number in the upper 20 bits, and
    /// the column index in the lower 12 bits.
    public
    let rawValue:Int32
}

and here’s an example of a bigger type that contains a field of the smaller type:

@frozen public
struct SourceLocation<File>
{
    public
    let position:SourcePosition
    public
    let file:File
}

when we pass instances of the bigger type to other swift code, we usually do not think about layout; the compiler knows to load self.position at offset +0 and self.file at offset +4 (alignment notwithstanding).

but of course, when we serialize the bigger type to binary storage that doesn’t understand the swift ABI (e.g. BSON), we do need to think about layout, and we need to specify what SourceLocation<Int32> looks like when encoded to a raw type like Int64:

extension SourceLocation<Int32>:RawRepresentable
{
    @inlinable public
    var rawValue:Int64
    {
        .init(self.file) << 32 | .init(self.position.rawValue)
    }
    @inlinable public
    init(rawValue:Int64)
    {
        self.init(
            position: .init(rawValue: .init(truncatingIfNeeded: rawValue)),
            file: .init(truncatingIfNeeded: rawValue >> 32))
    }
}

can you see the bug?

answer
    var rawValue:Int64
    {
        .init(self.file) << 32 | .init(self.position.rawValue)
//                                     ^~~~~~~~~~~~~~~~~~~~~~
//             UInt32.init(bitPattern: self.position.rawValue)
    }

anyway, my point is writing this kind of “pack-and-crack” raw value logic is very error-prone, and i feel that our current toolbox of FixedWidthInteger API leaves a lot of legos on the floor to step on. because even if you get it right the first time, it is rather fragile, and the type system does not help you if you, for example, switch between Int32 and UInt32, or Int32 and Int16, etc.

can we get some standardized API for this stuff, maybe

infix operator .. : AdditionPrecedence
postfix operator ..

extension SourceLocation<Int32>:RawRepresentable
{
    @inlinable public
    var rawValue:Int64
    {
        self.file .. self.position.rawValue
    }
    @inlinable public
    init(rawValue:Int64)
    {
        let (file, position):(Int32, Int32) = (rawValue)..
        self.init(position: position, file: file)
    }
}

?

1 Like

Well, this isn't unsafe - just error-prone. But I think it could be made less error-prone by improving how you model serialising/deserialising this data.

Firstly, you're using integers (the raw value in SourceLocation) not as numerical values, but as a fixed-size buffers of bytes. I don't think it is appropriate to use signed integers for that, but it's also going to be awkward in lots of other ways because the integer API is geared towards dealing with numbers.

So I would suggest that, as an alternative, you should create an encode function and decoding initialiser which accept a buffer of bytes from an external context. Perhaps something like this:

protocol BinaryEncodable {
  // We should only append to the stream - AppendableCollection?
  func encode(to stream: inout some RangeReplaceableCollection<UInt8>)
}

protocol BinaryDecodable {
  func init(decoding: some Collection<UInt8>)
}

If you encode in to an Array, you will benefit from its generous capacity growth policy which ensures that appending data is performed in ~constant time.


Now, since you're encoding in to an integer, I'm guessing that performance is critical. If you wanted to optimise the encoding, I would recommend adding a requirement to calculate the required capacity up-front:

protocol BinaryEncodable {
  func encode(to stream: inout some RangeReplaceableCollection<UInt8>)

  var requiredEncodingCapacity: Int? { get }
}

// Make it optional, if you like.
extension BinaryEncodable {
  var requiredEncodingCapacity: Int? { nil }
}

extension FixedWidthInteger where Self: BinaryEncodable {

  func encode(to stream: inout some RangeReplaceableCollection<UInt8>) {
    precondition(_isPOD(Self.self))
    withUnsafeBytes(of: self) { stream.append(contentsOf: $0) }
  }

  var requiredEncodingCapacity: Int? { 
    Self.bitWidth / 8
  }
}

extension UInt32: BinaryEncodable {} // etc, for all sodlib integer types

Then you would compose it in the straightforward way. This is a bit tedious -- I haven't had a chance to look at all the capabilities of Macros, but I believe this is precisely the kind of thing we wish to automate. So hopefully soon this will be a simple matter of adding @BinaryEncodable, but for the mean time:

struct SourceLocation: BinaryEncodable {
  var file: UInt32
  var position: UInt32
  
  func encode(to stream: inout some RangeReplaceableCollection<UInt8>) {
    file.encode(to: &stream)
    position.encode(to: &stream)
  }
  
  var requiredEncodingCapacity: Int? {
    guard let fileCapacity = file.requiredEncodingCapacity else { return nil }
    guard let positionCapacity = position.requiredEncodingCapacity else { return nil }
    return fileCapacity + positionCapacity
  }
}

Once you have this ability, a lot more options open up. The simplest (and recommend) approach would be to call reserveCapacity on your Array, and eliminate dynamic capacity growth.

If you wanted to go even further than that, you could use withUnsafeTemporaryAllocation to more aggressively stack-allocate the buffer (maybe you know you're often dealing with small values with limited lifetimes). To make this work, I would recommend:

  • Limiting the encode function to a new ByteOutputStream protocol, with a single function - append
  • Wrap the temporary allocation in a struct which implements this protocol
  • Add a precondition in append which ensures you never over-write the buffer's capacity, making the function memory-safe. If this precondition fails, it indicates that somebody implemented requiredEncodingCapacity incorrectly.

This would give you all the benefits of using integers, but in a more scalable, less error-prone way.

1 Like

actually, using integers has nothing to do with performance, i am using integers because they are BSON primitives (similar to how “Decimal”, Bool, String, etc. are JSON primitives).

in BSON, we have a fixed set of choices we can use to encode things:

type layout
byte 1 byte (8-bits)
int32 4 bytes (32-bit signed integer, two's complement)
int64 8 bytes (64-bit signed integer, two's complement)
uint64[1] 8 bytes (64-bit unsigned integer)
double 8 bytes (64-bit IEEE 754-2008 binary floating point)
decimal128 16 bytes (128-bit IEEE 754-2008 decimal floating point)

there are actually a few more, BSON.Identifier can be used as a hacky UInt96, BSON.UTF8 can be used to store UTF-8 data etc. but you get the idea.

now, BSON, or at least my implementation of it, actually does get encoded by writing to a stream like you illustrated, but this is abstracted[2] away by a Codable-style interface that looks kind of like KeyedEncodingContainer - it exposes API that can encode primitives directly or delegate to another type that knows how to do the encoding.

it is essentially this delegation that i’m trying to accomplish: SourceLocation<Int32> doesn’t know how to encode itself to BSON, but Int64 does (because it is a BSON primitive), hence the RawRepresentable<Int64> conformance.

anyway, hope that clears up the confusion.


[1] UInt64 is formally part of the BSON spec, but it’s not very well-supported; MongoDB will not round-trip them depending on where they appear within a document, so Int32 and Int64 are really the only “safe” codewords.

[2] the rationale for not exposing the output stream is because writing things - even integers - can be complex, for example BSON always serializes integers in little-endian order regardless of host endianness. the BSON stream also needs to bracket values with type metadata among other things.

1 Like

Okay, but then why are you not storing the SourceLocation type using two 32-bit primitives? Why pack them in a single primitive? Once you do that, you leave the realm of integer numbers and enter the realm of fixed-size byte buffers.

Regardless, I think it is clear that the language can express these concepts. It may be that BSON/MongoDB/whatever can be awkward, and it sounds like it will be difficult to model that, but I don't see an indication of some gap in the language.

Once thing I can suggest is that, if you want to do this sort of packing, perhaps it will help if you separate the concept of binary serialisation from the BSON primitives. Perhaps have a separate BSONSerialisable protocol, and when the destination type is an Int64, you can have a convenience method which encodes to the Int64:

// Or whatever. Maybe there's no refinement and these are just totally separate protocols.
// Doesn't make much of a difference.
protocol BSONEncodable: BinaryEncodable {
  associatedtype BSONOutput
}

extension BSONEncodable where BSONOutput == Int64 {
  var asInt64: Int64 {
    precondition(requiredEncodingCapacity == 8)
    var x = Int64.zero
    withUnsafeMutableBytes(of: &x) {
      var buf = PreallocatedBufferStream(buffer: $0) // Safe wrapper around UMRBP.
      encode(to: &buf)
    }
    return x
  }
}

I tested this very quickly and it all seems to work and compose in a straightforward way (and with macros, it should be even easier). Again, I don't doubt that a real, production implementation of BSON will have to deal with more complex realities, but I think from the Swift language side things work pretty well :man_shrugging:

1 Like

BSON does not have the same concept of type metadata as swift, to encode two Int32 primitives (a, b) requires encoding the dictionary ["0": a, "1": b]. (seriously!)

this is hilariously space-inefficient, it requires an overhead of 3 bytes per element, plus a constant 5 bytes to represent the dictionary container. which means a 8-byte value like (Int32, Int32) now requires 19 bytes to encode, a 138% increase!

this looks to me like RawRepresentable.RawValue delegation with different spellings.

just so we’re on the same page, the BSONEncodable protocol looks like:

public
protocol BSONEncodable
{
    func encode(to field:inout BSON.Field)
//                             ^~~~~~~~~~
//             this is the output “stream”
}

BSON.Field streams are truly monstrous things to work with, so there’s a RawRepresentable-based shim that lets you opt-in to defaulted conformance if RawValue:BSONEncodable:

extension BSONEncodable where Self:RawRepresentable, RawValue:BSONEncodable
{
    /// Delegates to the ``encode(to:)`` witness of this type’s
    /// ``RawRepresentable.rawValue``.
    @inlinable public
    func encode(to field:inout BSON.Field)
    {
        self.rawValue.encode(to: &field)
    }
}

so i am having a hard time seeing how

is an improvement over

Sure, you can use RawRepresentable to fill the same role, if you like. The thing I am attempting to illustrate is that you can separate binary serialisation (which is the thing you originally suggested was error-prone, IIUC) from all of these BSON quirks.

I appreciate that BSON may be very awkward to work with, implementations may be buggy or restrictive, etc. That's not a discussion I (or many in this community) can contribute to; but I can contribute towards a discussion about what the Swift language can do to make binary serialisation (not specific to BSON or its quirks) easier.

And that is how the extension I suggested is an improvement. You complained that it was error-prone to write this sort of code, and that:

By separating these aspects, the encoding code no longer deals in FixedWidthInteger APIs at all; it doesn't care in the slightest that you're storing this as an Int64 because that's a "BSON primitive". You could encode it in to an Array, or as part of a larger data structure, and everything would continue to work, because it is now expressing its layout in terms of storing to/loading from byte buffers, which is truly the thing you're trying to express when packing the data in this way.

And that's why I think this is really a question of you having to look for the best (or "least bad") model rather than an indication of a deficiency in the language.

(And FWIW, the compiler does a really good job of optimising this code. With the extension, it compiles down to the same code as you wrote - just a shift and an OR, even though the encoding is working at this more abstract level)

i might be missing something, but it just does not make a lot of sense to me to treat Int64 as a streaming output. you can’t, for example, just write one Int32 to an Int64 and return, nor can you write three Int32 values. you always have to populate the value with exactly 8 bytes of data, which is a really weird “stream”.

you cannot encode BSON arrays by repeatedly calling a write(_:) method on the stream as you would for a single element, a BSON array is simply a dictionary with the keys ["0": x[0], "1": x[1], ... , "\(x.count - 1)": x[x.count - 1]].

what this means in practice is that two BSON values never occur adjacent to each other in the binary output stream, there is always some interceding metadata. and so user code never really has any reason to interface with the output stream directly, only the container-encoder talks to the output stream.

What you may be missing is that Int64 is not the streaming output. The streaming output is generic; it can also be an Array, or a pointer, or whatever else. The streaming output is "some byte stream".

But if you have additional information that a type will produce 8 bytes (or less than 8 bytes, if you want to support that), you could write a convenient extension using a stream which is backed by an 8-byte integer. If you don't find that useful then don't use it, but I thought it would be helpful given what you have explained.

The point is that the encoding happens in terms of byte streams. How you map that to BSON is a separate problem, and is part of your creative freedom as a library author. I'm not seeing a language or standard library deficiency so far.

well, there are two types that participate in encoding, there is the Encoder-conforming type, which does perform encoding in terms of byte streams, and there is the Encodable-conforming type, which knows nothing about byte streams, it encodes by calling API on the encoder instance.

i am not asking about the (BSON)Encoder type (BSON.Field). there are only a handful of encoders, all of which ship with the BSON library and are extensively tested. i am asking about the variety of (BSON)Encodable types, which don’t interact with byte streams, and i feel, should not interact with byte streams - that is the encoder’s job.

Getting back to the original idea, there are definitely ways to make bitfields safer…but I don’t think it makes sense to focus on concatenation of quantities that are already aligned to bytes. I hacked together a prototype a while back for something a little more C-like, but I suspect with macros you could do a lot better (thinking about the DictionaryStorage example). As long as the bit-level manipulation is abstracted away (in a function or in a macro), the chance for misuse goes way way down. Cause yeah, thus far Swift has not come with any particular support for bitfield patterns, and you’ve had to roll your own (like the original code).

2 Likes

Not if you want those encodable types to do their own packing. Packing is intrinsically a byte stream operation. You're already doing it manually using shifts and manual widening/narrowing operations, which as you've discovered are not the best tools for the job.

And I'm suggesting that instead, the Encodable types can use a more abstract interface to encode their packed data as a byte stream, completely avoiding any Integer APIs, but while still surfacing convenient packed integers to your Encoder types.

Check out the simple example that I linked to. I don't write a single shift operation, and the implementation of the Encodable type is super-simple, just forwarding to the implementation of encode in each stored property. Yet it compiles down to the same code as you wrote when doing manual shifts and ORs. That is how you make the thing less error-prone.

1 Like

i think the easiest way to explain why this doesn’t make sense (for BSON) is to point out that BSON is dynamically-typed.

so if you have data that looks like:

let _:(x:Bool, y:Int32, z:Int64) = 
(
    true,
    0xAAAA_AAAA,
    0xBBBB_BBBB_CCCC_CCCC,
)

it corresponds to BSON that can be thought of (in “swift terms”) as:

let _:[Any] = 
[
    Bool.self,
    "x",
    true,

    Int32.self,
    "y",
    0xAAAA_AAAA,

    Int64.self,
    "z",
    0xBBBB_BBBB_CCCC_CCCC,
]

if you instead try to encode the Int64 in pieces, such as:

let _:(x:Bool, y:Int32, z:Int32, _:Int32) = 
(
    true,
    0xAAAA_AAAA,
    0xCCCC_CCCC,
    0xBBBB_BBBB,
)

you don’t get the same BSON, you get:

let _:[Any] = 
[
    Bool.self,
    "x",
    true,

    Int32.self,
    "y",
    0xAAAA_AAAA,

    Int32.self,
    "z",
    0xCCCC_CCCC,

    Int32.self,
    "",
    0xBBBB_BBBB,
]

it’s essentially the same reason why this makes sense:

let x:(Int32, Int32) = (1, 2)
let _:Int64 = .init(littleEndian: unsafeBitcast(x, to: Int64.self))

but this doesn’t:

let x:(Any, Any) = (1 as Int32, 2 as Int32)
let _:Int64 = .init(littleEndian: unsafeBitcast(x, to: Int64.self))

does that help?

I think the best approach here is to separate encoding a value into a particular format and encoding a value into a particular byte representation, and then combine these two mechanisms.

Step 1: Implement a robust and error-tolerant way to represent one or more values as a byte stream of some length. I personally like @Karl's way of abstracting this, but in my opinion, the encoder has too little control over the way the encoding goes, so implementing things like stream multiplexing and chunking could be quite challenging. But anyway, we are not discussing a general-purpose API for packing arbitrary data into byte streams here. You can come up with your own solution for such an abstraction. The main goal here would be to move the "byte-concatenating" logic out of business types.
This will result in something like:

extension SourceLocation where File==Int32 {
  var asUInt64: UInt64 {
    var data = 0 as UInt64
    // some cool and abstract way to pack two int32 fields into 8 bytes of `data`
    return data
  }
}

Step 2: If you use some Codable-based BSON serializer, just implement a custom encode(to:) function on SourceLocation:

extension SourceLocation: Encodable where File==Int32 {
  func encode(to encoder: Encoder) throws {
    var c = encoder.singleValueContainer()
    try c.encode(asUInt64)
  }
}

This approach will keep the BSON serialization and the representation of your struct in compact form independent of each other.

i’ve mostly got Step 2 figured out already, the crux of this thread is really about the

:slight_smile:

it seems like a lot of the discussion here has so far revolved around “streaming” bits to an Int64. to be honest, i am still unconvinced that streaming is the right paradigm to initialize a fixed-size value like an Int64, and i’m not sure how helpful it is to think of Int64 as some sort of 8-byte fixed-size array.

we are concatenating bits for sure, but i don’t know if that constitutes streaming in a helpful sense.

The only reason to think about Int64 is that BSON doesn't support some arbitrary sized raw data.
So you'd like to have a compile-time diagnostic that the size of the concatenated fields is equal to some expected value?

correct. one way to think of it is that BSON natively supports single-, double-, and triple-word values, where “word” in this context means a 32-bit value.

type bit-width
Int32 32
Int64 64
BSON.Identifier (“Int96”) 96

in my mind, we need a safe way to perform packs and unpacks, such as

Int16 ←→ (Int8, Int8)
Int32 ←→ (Int16, Int16)
Int64 ←→ (Int32, Int32)
Int96 ←→ (Int32, Int64)
Int96 ←→ (Int64, Int32)
Int96 ←→ (Int32, Int32, Int32)

it feels to me like there are an enormous number of possible overloads, which makes me wonder if a language-level solution is appropriate.

of course, we could instead rant all day about how stupid a format BSON is and how awful all the BSON libraries are and how much better off we would be with a new library/format. but the fact is BSON is here to stay, so we need a solution.


BSON actually *does* have a type for “arbitrary-length raw data”. but it has enough overhead (+5 bytes) that it is not worthwhile to use it as a replacement for the fixed-size integer types.

I haven't fully thought through the whole idea, but I think it's worth considering. We can express a fixed size linked list with:

struct Pair<L, R> {
  var l: L
  var r: R
}

Then we can introduce a protocol representing a value of compile-time known size.

protocol FixedBitWidthValue {
  static var bitWidth: Int { get }
}

extension Pair: FixedBitWidthValue where L: FixedBitWidthValue, R: FixedBitWidthValue {
  static var bitWidth: Int {
    L.bitWidth + R.bitWidth
  }
}

And lastly we can always end the list with

struct ZeroWidthValue: FixedBitWidthValue {
  static var bitWidth: Int { 0 }
}

This will give us a universal pack type.
Then we can introduce something like IteratorProtocol, but next will have 4 (or more) variants: nextUInt8(), nextUInt16(), nextUInt32(), nextUInt64().
And perhaps you could even use #assert to check if Pack.bitWidth == 64.

There are 2 problems:

  • adding integers to form a bigger one:
    • UInt32 + Int16 + UInt16 = Int64
    • UInt32 + Int16 = Int64 (16 bits are unused).
  • packing integers with shift:
    /// Contains the line number in the upper 20 bits, and
    /// the column index in the lower 12 bits.
    public let rawValue:Int32
    

I know that you asked if Swift "as a language" can solve this problem, but I think that the solution does not call for this. Though it could make a good Swift package for "open source points".

Btw. every time I need to do any bit masks I immediately go for unsigned type. I don't know why, it is just a habit, for me it is simpler.

Problem 1: adding integers to form a bigger one

This is going to be a code dump. I added comments to explain what is going on. Code is untested, it is more of a sketch.

// Pack into 64 bit.
// Create such type for every bitWidth that you want to support (Pack32 etc.).
// You can use code generator.
struct Pack64 {
  typealias Storage = UInt64
  private var value: Storage = 0
  private var shift: Int = 0

  // Get packed value, this is what you put into BSON.
  var unsigned: UInt64 { self.value }
  var signed: Int64 { Int64(bitPattern: self.value) }

  // We need Int name in method name for symmetry with `Unpack`.
  // Add such method for every Int type that you want to be able to pack.
  mutating func packUInt32(_ n: UInt32) {
    let width = UInt32.bitWidth
    precondition(self.shift + width <= Storage.bitWidth, "Pack out of bounds")
    self.value = self.value | Storage(n) << self.shift
    self.shift += width
  }

  mutating func packInt32(_ n: Int32) {
    self.packUInt32(UInt32(bitPattern: n))
  }
}

struct Unpack64 {
  typealias Storage = UInt64
  private var value: Storage

  // Unpack from signed or unsigned.
  init(_ value: UInt64) { self.value = value }
  init(_ value: Int64) { self.value = UInt64(bitPattern: value) }

  mutating func unpackUInt32() -> UInt32 {
    let width = UInt32.bitWidth
    let mask: Storage = (1 << width) - 1
    let result = self.value & mask
    self.value = self.value >> width
    return UInt32(result)
  }

  mutating func unpackInt32() -> Int32 {
    let u = self.unpackUInt32()
    return Int32(bitPattern: u)
  }
}

struct SourceLocation {
  typealias BSONStorage = Int64

  // Line is signed just to show that we can deal with signed values.
  let line: Int32
  let column: UInt32

  var encode: BSONStorage {
    var p = Pack64()
    p.packInt32(self.line)
    p.packUInt32(self.column)
    return p.signed
  }

  static func decode(_ source: BSONStorage) -> SourceLocation {
    // Note the symmetry with 'encode':
    // - same field order
    // - similar method name (just add 'un' prefix)
    // Compiler will error if somebody changes the type of 'line' to 'UInt32'
    var p = Unpack64(source)
    let line = p.unpackInt32()
    let column = p.unpackUInt32()
    return SourceLocation(line: line, column: column)
  }
}

Quick test:

let loc = SourceLocation(line: -15, column: UInt32.max)
print(loc)
let bson = loc.encode
print(String(UInt64(bitPattern: bson), radix: 2))
let decoded = SourceLocation.decode(bson)
print(decoded)

Problem 2: packing integers with shift

Just add width argument to pack/unpack methods. Totally not tested, but you should get it working.

struct Pack64 {
  // Allow user to specify the width of the field.
  mutating func packUInt16(_ n: UInt16, width w: Int = Int.max ) {
    let width = w == Int.max ? UInt16.bitWidth : w
    precondition(self.shift + width <= Storage.bitWidth, "Pack out of bounds")

    let mask: UInt16 = (1 << width) - 1
    let outsideMask = ~mask
    precondition(n & outsideMask == 0, "Pack trimming bits")

    self.value = self.value | Storage(n) << self.shift
    self.shift += width
  }
}

struct SourceLocation {
  var encode: BSONStorage {
    var p = Pack32()
    p.packInt32(self.line, width: 20)
    p.packUInt32(self.column, width: 12)
    return p.signed
  }
}

Given Swift’s guarantee that homogeneous tuples are layout-compatible with C arrays, this should be one place it’s fine to use unsafeBitCast, right? You have to be conscious of endianness if you’re using this serialization to send data between systems, but if you’re just using it to store data on the same server, it should be fine.
That probably won’t work for Int96, but since I don’t know how you’ve implemented that (as it’s not a standard library type), I can’t say for certain.

I wonder if you'd encounter any issues with alignment if you use unsafeBitCast.

> MemoryLayout<(Int8, Int8)>.alignment
$R0: Int = 1
> MemoryLayout<Int16>.alignment 
$R1: Int = 2

I don't think the documentation makes any guarantees about that, so it may not be safe to assume anything. Does it always make a copy in to correctly-aligned storage and never optimise that out?

EDIT: Seems it's fine. Even though it's a value-level operation, I was concerned the compiler would assume the original value was correctly-aligned for the new type and optimise out copying the bits, but apparently it should be considered a bug if it does that.

1 Like