Bag-of-Bytes types in Swift

Before I start of my experience and opinions on the matter, I want to refer to the forum post that sparked this post. Why must i build ALL of NIOCore just to say the name ByteBuffer? by @taylorswift

After this post, I had a follow-up conversation with @glessard and we discussed the need for more/different bag-of-bytes types in Swift. I didn't immediately get back to him, but I finally kicked off the discussion with him after the presentation by @Tony_Parker on Server-Side Swift conference and related forum post: What’s next for Foundation

I commented that it's time to seriously talk about Data. I don't mean the specific Data type seen in Foundation, I mean the general way Swift developers interact with bytes. And because there's a whole set of use cases that impact the design, I've written out my experience in both open source and closed source development in Swift. Note that these are my experiences and opinions, and it's important to hear more perspectives. I do not think nor hope to have a one-fits-all solution for binary data.

Below is a mostly-intact message I've sent in my conversation with @glessard.


Over the last 7 years of Swift development, I have a large variety of use cases for which I need(ed) a "bag of bytes" type. I think the best way to approach this is to first cover all the use cases that need "a" bag of bytes type, and then see how and if we can find an agreement between these requirements. I'll mostly disregard what's already possible, for the sake of completeness. This'll also reflect upon the wider ecosystem, including NIO. This could then evolve in either one or many types covering different needs. We also need slices of buffers that don't make copies, and copy-on-write semantics. Again, we're all good here.

The first, and most common, is protocol (de-)serialization. I need control over endianness, writer and reader indices, consuming and non-consuming reads. I also need control over (re)allocations sometimes. I think SwiftNIO's ByteBuffer is just amazing here, especially combined with ByteBufferAllocator.

Second use case is for even lower-level systems. In some of my projects I need tight control over memory (de)allocations, and because we're working with Apple's APIs in these projects, we actually are running into a couple tough situations. First of all, iOS can have tight memory limits, especially in app extensions. We don't want to allocate new buffers for each message we transfer, so we'll need something like a memory arena. In addition, we want to apply back pressure if a buffer is not yet available. We've built our own types for this right now. The main 'issue' here initially was that Apple's APIs use Data. So we ended up wrapping our arena's pointers in bytesNoCopy with a custom deallocator function that returns the region to the arena.

Next up, when working with iOS, you'll work with (NS)Data more often than not right now. But assuming we don't need to take that into consideration, it's fundamental that we could - at least to some degree - interchangeably use contiguous binary data types. This is more of an API design thing than anything else, but I do think it's essential if you want to support various different use cases without making one or more copies of your data.

Another use case, more so on the server, is static file serving. Ideally you'd stay within kernel-space, something you can do with NIO's IOData/FileRegion. Regardless, I think this use case is not to be forgotten, as it's a huge boost when your TCP stack is not in user space.

In many libraries, end users of mine receive their data over an HTTP(2), database or other TCP-based connection. In these cases, we're working with ByteBuffer. One or more TCP packets can form a single application layer message, like an HTTP body. This can be seen as a contiguous blob, mainly for databases, or a stream like is ideal for HTTP bodies. Regardless, I think it's important to be able to accumulate information in a buffer with a corresponding capacity, since many times you'll know you the final payload size before the arbitrarily sized storage starts. Having this contiguous blob with a predefined capacity, prevents (re)allocs. We could then reuse this message buffer for the next message in line.

Likewise, many protocols have a hard limit on the size of a message. I think it would be great if we could reuse buffers for that. Often enough, a message is processed directly, into say a data structure of the user's or library's choice. Many times this'll happen through Codable. Many structured data formats, including MongoDB BSON and MySQL, will make slices of data and pass them around a Decoder as well. Decoding for these formats tends to use non-consuming reads, though parsing strategies will vary. BSON specifically is a type that contains itself recursively. The main type (Document) is closely resembling a JSON Object type, and can contain another Document, just like JSON Objects can contain other objects.

A lot of issues with ByteBuffer here is the fact we have to bundle the whole of NIO, mainly because of binary size. Many people want to use (my) libraries for things like XML, JSON and BSON on other platforms, but don't want to depend on the whole of SwiftNIO as well. This becomes vastly more important when talking to people doing WASM and/or embedded Swift.


I think this mostly boils down to reducing copies/allocations within both Swift and between kernel- and user space applications. Reducing copies between your apps and the libraries created in both Apple's and Linux' ecosystems (not ignoring the work on Windows either!)

Likewise, we need to consider while they're all basically a bag-of-bytes, the public APIs, (de/re)allocations and various other details can greatly differ. We don't want to import an unnecessarily huge library for this use case, but we do want to use our bytes where we can.

Now I don't think I have a good solution to all these needs, and I can't say I've really tried. But I do know there's a huge demand for this, not just by myself but a large set of Swift developers. I'm happy to put effort into (contributing to) designing various solutions to these problems and use cases listed above, and those that I'm sure will be commented below. Before any large effort gets kickstarted, I'd love to hear other opinions on the matter.

Finally some thanks to @johannesweiss and @lukasa for their work on SwiftNIO, and a polite request for your experiences and opinions on the matter as well.

12 Likes

I do think the zero-copy space will be massively improved by non-copyable types. Non-copyable types allow for uniquely-owned buffers (rather than Data’s copy-on-write or NSData’s sharing), as well as safely scoped buffers when borrowed (rather than UnsafeRawBufferPointer’s trust that the buffer will stay alive).

6 Likes

one thing i have realized after banging my head against the wall trying to come up with a solution is that we have two contradictory problems here:

  • bag-of-bytes needs to interoperate with the containers we have today:

    • [UInt8]
    • ArraySlice<UInt8>
    • UnsafeRawBufferPointer
    • ByteBufferView
    • Data
      which means it must be generic.
  • code that uses bag-of-bytes needs to be able to get out of the generic world and into the concretely-typed world (via @_specialized, comical overuse of @inlinable, etc.) to achieve reasonable performance.

2 Likes

Historically we've used withContiguousStorageIfAvailable and similar functions as our escape hatch from the underspecialization-vs-overinlining problem, but it's a solution with a number of important downsides.

2 Likes

this really doesn't follow at all to my mind. What would a bag of bytes even be generic over? It's just bytes!

we want it to be zero-copy, which means it has to support generic backing storage. otherwise, how would you bridge a ByteBufferView to it, without importing NIOCore?

“Generic backing storage” is a just pointer and an (optional) owning object—what @Michael_Ilseman would call “deconstructed cow”. This does not require generics.

2 Likes

While I broadly agree with this, I want to throw a curveball here and point to my desire to do slicing. This is important for the parsing approach. Consider the API surface of swift-protobuf. This module defines a protocol that represents a protobuf message (Message), and provides an initializer takes a binary object to decode the Message.

However, Message types may contain bytes fields. These are binary blob fields. In our ideal world, these fields can be represented with simple slices of the input type, or something broadly equivalent. I have some thoughts on how we might achieve that, but I want to bring it up here as an important use-case.

12 Likes

I concur with you here. Slicing is an important feature in many apps I've worked on.

1 Like

is it possible to implement such a type using tools that exist today?

this happens in BSON too. but oftentimes the embedded documents are a very small portion of the original byte buffer. that i'm retaining slices instead of copying to dedicated storage (which would allow the message buffer to be released) is more a reflection that my BSON library doesn't have a good way of expressing "i'm ready to copy this thing so it's backed by a [UInt8]"

I am interested in this, too. I believe Swift could excel in this area.

I wrote previously [about "binary
interfaces"](https://forums.swift.org/t/binary-interfaces-storage-device
s-protocols/60164), which takes in protocol messages and storage formats
and others aspects of interacting with bytes that are easy if perilous
in C.

Dave

1 Like

Oops, here is the right link to my post about "binary interfaces".

Dave

1 Like