[Discussion] Bag of Bytes Types

I use Data for bags-of-bytes; I’m not entirely sure why I chose it over [UInt8]. Probably

  • I was used to NSData from Obj-C
  • It ‘feels’ more opaque than an array, less focused on the bytes themselves (kind of like RawSpan vs Span<UInt8>)
  • It’s easy to convert to/from String

The fact that it’s not in the core library is somewhat awkward since I’m developing cross platform code, but my code needs FoundationEssentials for other reasons anyway.

FWIW: String(validating:as:) and String(decoding:as:) are the generic works-on-any-collection stdlib versions of String(data:encoding:)

4 Likes

This might not be what you're looking for but I use [UInt8] a lot because it compiles to code that I would expect whereas Data on Darwin has historically generated a massive amount of code for even simple operations like subscripting or getting the count. I have run into several cases where code that uses Data has not been optimized by the compiler whereas the one that uses arrays does exactly what I want, sometimes to the point where I will change all the intermediate types to be arrays and convert at the boundary.

If you scroll back up to the top, it’s a post describing optimizations in Data for Swift 6.4 that should hopefully eliminate a lot of that overhead.

2 Likes

I just noticed this post yesterday. Having worked with SwiftNIO's ByteBuffer and having tried to optimize mem usage in my packages, I have some opinions I want to share.

First of all, as I see things, the best case for a process is so when it owns "some-bytes" in its memory space, it should no longer copy any parts of that some-bytes. Optimally. So we should try to go for enabling no-copy usage as hard as practically possible IMO.

SwiftNIO does a good job at this by introducing readerIndex and writerIndex to the mix, while owning a pointer to the some-bytes. Basically a lowerbound and an upperbound of what the current instance should care about at all, compared to the bytes the ByteBuffer owns.

The good thing about this is that when you only care about only a part of this some-bytes, you can only modify the reader/writer index without needing to copy bytes around. Only some integer arithmetic which is wayy cheaper than memcpy or such.

You might think of "Why not just keep writerIndex, and always align the base pointer to be the start of the bytes?" to which I'd say that won't be the end of the world, but keeping the readerIndex is still worth it IMO so you can walk back the bytes if required. For example one of the things I have experience with is DNS, where the wire format allows receiving a pointer-value to a former position in the bytes you've received, which points to a domain name. So when you see you've received such a pointer-value, you need to temporarily walk back the bytes to find the domain name you're supposed to parse.

Furthermore, I think now that Swift has a fine non-copyable support, for these kinds of types it should default to a non-copyable type, and then provide wrapper types that makes a reference-counted object out of that non-copyable type or such.
Rust has a ARC type for this (as it does for other basic functionalities like CoW/dyn-Box etc...), but this is easier in Swift. Any kind of class will do the Box & ARC together. I'd say that class should also come with some CoW already, but I'll have think a bit deeper to decide for sure and to see if we should use separate types for CoW / ARC etc... . Possibly we might end up adding such types to the language itself just so not everybody has to manually introduce them in their library.

So 1 type + wrapper-types. A base non-copyable, minimum-overhead type, that can be wrapped into CoW/ARC etc... on demand.

We should also consider having a stack-or-heap bag-of-bytes type, usually named "Small<bag-of-bytes>" or "Tiny<bag-of-bytes>" for when there can be a low amount of bytes (16 or 24 usually) like how String works with pretty much always keeping 15 utf8 bytes inline (unless in rare cases, for example when it had to repair a sequence of utf8 bytes).
I have not thought everything through, but I think the Vision should mention how such a type fits into it. It can be pretty helpful if it's not possible to absolutely get to a place where we do absolutely 0 copy like what would be optimal. Because if we have to actually do copies, then such a "Small<bag-of-bytes>" can be clutch for performance.

Last thing is that we should also consider how such a type(s) interacts with parsing/serialization. Possibly not some special case types, but it should be clear what we're aiming for.
Such type(s) should be similar to the new ParserSpan in GitHub - apple/swift-binary-parsing · GitHub but it also needs to have critical differences.
It should allow not having to copy memory around when a sequence of the bytes in the bag are needed in the same form. Again, ByteBuffer allows this by moving indexes around. Such a parser type should also allow this.

IMO the way this might be best-implemented is to use an enum with a case of the raw non-copyable type and a CoW & / ARC type (or similar).
So in the parsing process, if there is a need to copy bytes, the enum can switch on-the-fly from something like a Span, to a CoW & / ARC type.

To take it one step further, I wish that we someday can reach a point to rewrite/modify the internals of some of the common (stdlib) types we currently have to allow sharing their internal storage as much as possible.

For example if you've received utf8-bytes over the wire, optimally there should be no need for any copies, to make an String out of the bytes.
The way I see this possibly happening is for String to use ARC & CoW ed wrapper types of the incoming bag-of-bytes type. Then when creating a new String from some utf8-bytes in the bag-of-bytes, you only need to tell String to view that part of the bag-of-bytes as its data.

2 Likes

Thanks for the context on your experience. I've definitely encountered this myself. Like @snej pointed out, we've done a lot of work in Swift 6.4 to improve this (a decent amount on Darwin while maintaining ABI stability but significantly more on non-Darwin as well where we do not need to preserve ABI stability). If you have any examples of places in your code where you tried to use Data but couldn't because of the lack of optimization, I'd love to hear if Swift 6.4 has improved this for you. There may still be cases that we haven't seen yet that we can improve!

Thanks for your detailed thoughts, I appreciate it! I definitely agree with a lot of items you bring up here from the need for reducing copies via ~Copyable / immutable reference counted ownership to the Small/Tiny stack-to-heap-spilling type you mentioned. I've often reached for the latter type myself and found Datas implementation of this concept to be lacking when compared to something more customizable. Will be sure to include some of these thoughts in the vision document!

1 Like

Just discovered this thread; thanks for posting it! A significant portion of what I do with Swift tends to involve byte buffers, given that my main app deals with file compression, so this is actually a topic that is very dear to my heart.

What bag of bytes types do you use in your code? Do you use different types for different purposes?

Given that my app is written in Cocoa and has been evolving since well into the Objective-C days, a lot of the older code in the project uses Data. However, I've been gradually breaking off pieces of the code that I think may be generally useful and putting it up on my GitHub account. When I do that, I tend to rewrite it to take some Collection<UInt8> as parameters for external-facing APIs, and when I need to create byte buckets from inside the library code, I usually reach for ContiguousArray<UInt8>.

Why did you choose to use these types / what aspects led to these decisions?

For my code that uses Data: this will be sometimes inertia, and sometimes convenience for use with APIs that take Data in Foundation or other Apple-supplied frameworks. In addition, when I am dealing with something that needs to encode and decode strings in various encodings, Data can be very useful for that.

For refactored libraries and green-field code, I tend to like the following things about ContiguousArray<UInt8>:

  • It does not depend on the Foundation framework, which means that libraries that I use it in do not force their clients to link against Foundation, which gives a library broader applicability than it would otherwise have.
  • It is completely disconnected from the Objective-C bridging logic, which makes it simpler for me to reason about what is going on under the hood.
  • It supports all of the functionality I generally need.
  • As you've noted, it often seems to perform better than Data, although from the looks of it, that may be changing.

(yes, I know that Array<UInt8> probably would work just as well; I just like the definitely-not-ObjC-bridgable nature of ContiguousArray. I am aware that this may be an emotional decision rather than a practical one.)

In general, what are important characteristics of bag of bytes types for your use cases?

Off the top of my head:

  • The most important thing, to me, is the ability to efficiently get access to an UnsafeBufferPointer (or, these days, a Span) so that I can process the contents as performantly as possible.
  • When interacting with C APIs, I like having a way to initialize my byte bag type directly from uninitialized memory, without everything being initialized to zero first, for performance reasons. The init(unsafeUninitializedCapacity:initializingWith:) on Array and ContiguousArray is great for this. Data does not have it.
  • Sometimes, a C API will give you a byte blob that you have to manually free later. In that case, it's nice to be able to build a byte bag that wraps it, and frees it whenever the byte bag is disposed of. Data's init(bytesNoCopy:count:deallocator:) is great for this, and this time, it's Array and ContiguousArray that don't have an equivalent feature. This can also be used to accomplish the previous bullet point with Data, although it's a bit more boilerplate.
  • What no built-in type supports, as far as I'm aware, and what would be totally great, would be the ability to load and store values of various types, like UnsafeMutableBufferPointer,but with the ability to specify endianness, so that you could store, say, a big-endian UInt32 to some offset in your data, and when, running on a little-endian ARM64 machine, it would byte-swap it. Currently I'm using my own homegrown wrapper types for this.
  • It's worth noting that there's no specific reason that Data itself, or any other specific type for that matter, needs to support all these features by itself. If we had a good protocol, for example, the init(bytesNoCopy:) feature could just be a separate type that wraps a pointer, rather than Data itself having to do it. However, that brings me to:
  • The biggest annoyance that I have in this area, bar none, is the lack of good protocols for data types. While every built-in type that conforms to some Collection<UInt8> supports getting byte buffers and/or spans, the Collection protocol itself does not, and the DataProtocol and ContiguousBytes protocols are specific to Foundation and cannot be used in libraries that don't pull it in. This means that to have a non-Foundation-requiring library, you either have to do a bunch of dynamic type checks or create your own custom protocol, and those tricks won't work to support Data itself, since you'd have to pull in Foundation to be aware of either the Data type or the DataProtocol that supports it in order to dynamically check for those types, meaning one has to either do a byte copy into a ContiguousArray or resign oneself to using the slow byte-by-byte Collection access. For this reason, the biggest win for me bar none would be to get either DataProtocol or some other "I can make an UnsafeBufferPointer and/or Span from this" protocol in the standard library instead of only inside the Foundation walled garden.
3 Likes

Kudos for your work @jmschonfeld, the benchmarks look impressive. I haven’t really explored the benefits of choosing either type in particular, but maybe you guys can help me squeeze some more performance out of my use case.

I maintain a cross-platform VPN app that therefore does a lot of boring crypto and data copy, with a significant amount of C code involved because the ergonomics always felt more natural than Swift to me for all the raw back and forth.

Looking at my own Swift code, I see a mix of Data and [UInt8] without particular reasons other than having written the code in different times.

My questions in light of this topic:

  • What’s the preferred type to treat a high amount of data packets with in-place encryption and decryption?
  • What’s the preferred type to avoid continuous bridging/copy between Swift and C?
  • Does Swift support a sort of “arena allocator” in that I give the process say 10MB and as long as I’m within the range it never calls malloc() a single time?

Overall, I wonder if there’s a strategy or a bag type in particular to minimize or even avoid Swift buffer allocations because the way I use memory is very predictable.

Thank you!

1 Like