[Discussion] Bag of Bytes Types

Hi everyone,

For Swift 6.4, I’ve spent a fair amount of time working on bag of bytes types (including Data in particular). Based on community feedback and some performance investigations, Data has now gained some fairly significant performance improvements, including:

  • Eliminating unnecessary exclusivity checking (which previously incurred significant runtime costs)

  • Adopting an entirely new ABI on platforms without ABI stability (such as Linux, Windows, Android, and WASM) that drastically reduces client code size and improves throughput performance of common operations

  • Improving specializations and fast paths in common operations to reduce runtime overhead and eliminate some common pitfalls

These performance improvements have led to some significant wins in benchmarking:

  • Data.bytes is 787% faster and produces significantly smaller client binaries with the new ABI, and 147% faster with the existing ABI

  • Data.count is 363% faster with the new ABI

  • Data.== is 74% faster with one-third of the client binary size with the new ABI, and 47% faster with the existing ABI

  • Appending a byte to a Data is 34% faster with the new ABI

  • Iterating a Data is up to 24% faster, and produces significantly smaller client binaries with the new ABI

In common cases, I’ve found that the throughput performance of Data is now comparable to the performance of Array<UInt8> for equivalent operations when using the new ABI. I’d encourage you all to give it a try and let me know how it impacts your apps!

Improving the performance of Data was a key first step, but we still have more work to do (especially with the introduction of new non-copyable and non-escapable types and language features). As these features develop, I’m working on putting together a vision document to guide the direction for bag of bytes types in Swift. I plan to pitch this vision document as a tool to help us establish concrete recommendations for developers and craft future evolution proposals in this space.

As part of this process, I’d first like to kick off a discussion around your existing use of bag of bytes types in your projects to gather information about experiences today. This is not a pitch for new API or an evolution review yet, but rather a request for information on how you use bag of bytes types and what you feel are important aspects of these types. In particular, I’m interested in hearing from you on:

  • What bag of bytes types do you use in your code? Do you use different types for different purposes?

    • Examples include Data, Array<UInt8>, Unsafe(Mutable)(Raw)BufferPointer, ByteBuffer, DispatchData, span types, etc.
  • Why did you choose to use these types / what aspects led to these decisions?

    • e.g. providing module, API surface/functionality, behavioral guarantees, performance characteristics, etc.
  • In general, what are important characteristics of bag of bytes types for your use cases?

    • e.g. alignment guarantees, allocation or lifetime guarantees, categories of APIs available, interoperability requirements with other languages or SDK/toolchain APIs, etc.
  • What does the flow of your bytes look like through your application or API?

    • What APIs do you use to receive bytes, what APIs do you provide bytes to, what does the lifetime of your bytes look like?

Thanks in advance for sharing your experience. I hope we can identify some key themes that I can incorporate into a future evolution discussion around a long-term vision for bytes.

38 Likes

This sounds awesome!

I used Data frequently when writing swift-libgit2, a memory-safe Swift wrapper around the full libgit2 C API. There's a lot of Swift-to-C and C-to-Swift conversion to keep function signatures and struct shapes the same as their C equivalents.

libgit2 has a lot of char * types. Where the bytes are textual and known-terminated I use String, and otherwise I generally use Data, specifically when the bytes might be binary, non-UTF-8, or unterminated. Most outbound calls go through Data helpers like:

internal extension Data
{
    func withCString<T>(
        _ body: (UnsafePointer<CChar>, Int) throws -> T
    ) throws -> T
    {
        return try self.withUnsafeBytes
        {
            bytes in
            
            guard let baseAddress: UnsafeRawPointer = bytes.baseAddress
            else
            {
                throw NSError.makeCConversionError()
            }
            
            return try body(
                baseAddress.assumingMemoryBound(to: CChar.self),
                bytes.count
            )
        }
    }
}

And variants like withMutableCString(_:), withMutatingCString(_:), and withOptionalCString(_:).

I had to be careful about what empty Data meant, since it has semantic meaning other than an error. For example, empty binary diff data may represent no changes, and empty credential data may be valid. In these cases, callers of the internal helpers handle the empty data by passing nil for the C string and 0 for its count. In other cases, like user-provided input, empty Data is invalid and a thrown error is more appropriate.

Bytes cross the interop boundary in both directions:

// Data in to libgit2
public func gitDiffFromBuffer(
    out         : UnsafeMutablePointer<OpaquePointer?>,
    content     : Data,
    contentLen  : Int // Unused; mirror libgit2 C signature.
) -> GitErrorCode
{
    return withCConversion
    {
        return try content.withCString
        {
            cContent, cContentCount in
            
            return git_diff_from_buffer(
                out,
                cContent,
                cContentCount
            )
        }
    }
}

// Data out from libgit2
public func gitDiffToBuf(
    out     : inout Data,
    diff    : OpaquePointer,
    format  : GitDiffFormatT
) -> GitErrorCode
{
    return withCConversion
    {
        return try out.withMutatingGitBuf
        {
            cOut in
            
            return git_diff_to_buf(
                cOut,
                diff,
                format.cValue()
            )
        }
    }
}

I don't actually have a feature request, since Data has worked very well for me. This point did remind me about Span/RawSpan, which came out while I was working on swift-libgit2. I haven't migrated to using them yet, but all of my with* accessors are hand-rolled scoped borrows, so they could be useful there.

Looking forward to reading the vision.

4 Likes

I rarely use Bag of Bytes types, sometimes I use the Data type for some short-lived data, usually containing UTF8-encoded strings, but once there was a different use case, which I had to fight with:

One of my projects needed to have a buffer, pointer to which was passed in an external library written in C. The buffer should be allocated once and live throughout the entire app run time, and this pointer should not change, that is the buffer should not be implicitly copied or passed by value. I don't remember exactly what data type I was unsuccessfully tried to use (Data? Likely a struct) and ended up using malloc and free.

It would be great to use Swift data type in this case, not C. Maybe modern Swift already has a solution.

1 Like

Thanks for sharing the examples in your project! I'll definitely have to take a look at these to see some good examples of these use cases in action.

String is a good point here, I often just think of types like Array/Data when thinking of "bags of bytes" but String is also related here for bags of bytes that are known valid UTF-8. Did you choose to use a String instead of a Data there because of additional APIs available on String that you might need, or just to help preserve the invariant of known valid UTF8? Also, when you chose Data for other bytes did you debate between any other types like Array, etc.? Curious to hear if you had any explicit reasons to opt for Data over the others, or if it was just a natural fit with no reason to investigate the alternatives.

Yeah it seems like spans might be very useful to you here to help reduce the closure nesting (and the use of unsafe pointers if the C APIs are annotated in a way that allows the clang importer to import them with span parameters). If you end up attempting any migration there I'd love to hear about your experience since I think borrowed bytes are also very important to the story here in addition to owned bytes!

1 Like

To clarify, do you mean your Swift code created this buffer once and passed it to C APIs over the lifetime of the app, or that a C library created the buffer and handed it off to Swift code which used it for the lifetime of the app?

Avoiding implicit copies is a feature gaining more traction called non-copyable (~Copyable) types. I'd be curious to know if you had a bag of bytes that was ~Copyable whether that would work for your use case, and if so what other characteristics you'd want (mutability? resizing? custom allocation of the buffer?)

When I was last noodling around with some protocol parsing stuff, NIO's ByteBuffer was very convenient for doing protocol packing-unpacking. Having some of that API surface higher up in the chain might be very convenient.

I would love to be able to just use one type, like ContiguousArray<UInt8>, but I often find myself converting between multiple different types based on what APIs require. e.g. I'll read an image using some Foundation API that returns Data, then convert it to [UInt8] to pass it off to a decompression API, then pass that into various platform-specific rendering APIs that may use [Int8] or Data or UnsafePointer<UInt8>! or...

1 Like

My Swift code created this buffer once and passed it to C APIs. The buffer contained a C struct. I was not able to use Swift struct and used malloc, free, and unsafeBitCast.

Most likely a ~Copiable struct was what I needed. The project was written before ~Copiable existed.

In-place mutability.

Oops... It was not a Bag of Bytes, it was a struct... I turned it into a Bag of Bytes only because was unable to use a struct.

But I can imagine similar use cases involving a Bag of Bytes.

1 Like

Yeah, this is definitely a sore spot for some APIs and is something that I'm hoping having an aligned long term vision for bytes can help alleviate over time. To that end, if you have any insight on your experience with question/bullet 3 about what characteristics are important to you based on your experience with each of these types, I would love to hear it. I think those characteristics impact what bag of bytes types are selected and perhaps differing required characteristics may lead to some of these splits in preferred types.

Thanks for sharing! I think that's definitely a key characteristic that some people are reaching for. By being convenient for protocol packing-unpacking, do you mean that it's important to you that bag of bytes types have APIs like the various readX functions to read values of differing types from the buffer, or are you referring to other APIs?

While developing the Encoder types for both swift-foundation's JSON/Plist encoders and the new-codable design, I've naturally reached for various bag of byte types to write out the encoded bytes. I've used Data, [UInt8], and even custom non-copyable types.

Choosing between these types is typically purely driven by performance benchmark profile observations. Data was the natural thing to reach for, but before your performance improvements, it could have high overhead, especially for the use case of encoding. [UInt8] generally was able to perform better, especially in optimized code.

With the advancements in move-only types came the opportunity to more reliably eliminate ARC activity on this uniquely-held bag of bytes (at least until the encoding completed). I tried UniqueArray<UInt8> and definitely saw a performance boost over both Data and [UInt8]. Surprisingly though, I saw a modest (~10%) increase in footprint moving to a more focused raw "bag of bytes" type over one that had to be explicitly parameterized to hold UInt8s (like UniqueArray).

I'm personally looking forward to a possibility of being able to use a library type for this rather than maintaining my own. :grinning_face:

For encoding formats like XML plist and especially JSON, but ability to efficiently append both moderate (e.g. user strings) and very small (single delimiter characters) is extremely important. Bonus points if the type is flexible enough (either through API or optimizer tricks) to recognize that I'm about to append multiple bits in a row (e.g. :, ", user string, ",,) and pre-size itself and elide redundant length-vs-capacity checks. These checks, apart from the actual growing of the array, ends up being the majority of the overhead.

It's pretty straightforward in my case. The encoder drives a "writer" type that owns the buffer, either as encoding happens or all in one go after encoding. The bytes are written incrementally and are eventually given to the caller to own. The "writer" is destroyed and so no longer needs to reference to bytes after the encoding is complete.

Bags of bytes are also used in decoding, but there isn't much interesting stuff going on there. We just need random access to one or more bytes at a time. The new-codable work is leaning heavily into RawSpan and its ability to temporarily vend portions of the input directly to the client to avoid copying. The decoders there don't even own the input bytes at any point, unlike the encoders.

3 Likes

Thanks, glad they're useful!

Mostly the latter reason: preserving the known-valid UTF-8 invariant (terminated, textual, non-binary). I have git_buf converters on both String and Data and pick between them based on the underlying libgit2 API. For example, git_branch_upstream_name() writes the upstream branch name into a git_buf, and my wrapper gitBranchUpstreamName() takes an inout String? for out, since a branch name has no reason to be Data. I did consider using Data everywhere for consistency, but went with the more semantically-correct option wherever a plain String fits. The type itself is basically a hint about what's expected and what comes back. It was less about available APIs, since I hand-rolled most of the with* methods, though String.withCString(_:) was foundational underneath many of them.

Not really; it honestly never occurred to me to investigate alternatives where Data was appropriate. For example, I could have used [UInt8] instead of Data, but Data always felt like the more idiomatic and semantically-appropriate choice to represent "some bytes". I did use arrays frequently in the "list of things" cases: gitGraphReachableFromAny() accepts [GitOID], GitCheckoutOptions.paths is [String], each backed by a matching Array helper ([GitOID].withArrayOfGitOIDs(_:), [String].withArrayOfCStrings(_:)) for the conversion.

For sure! The closure nesting is real, since some libgit2 APIs like git_diff_buffers() take 12 parameters and I sometimes have to convert many of them using closures, but luckily the logic is all abstracted away. Unfortunately, from what I remember, I don't think most libgit2 headers are annotated to allow the parameters to be imported as Span, so I'd probably have to adopt it on the Swift side of the API. Still worth trying, and I'll definitely check in about it if I get it done!

1 Like

Right. This sort of thing is useful for exploring packed binary data like file formats or network protocols (which is presumably why it's a key convenience for NIO).

1 Like

For the last part of my post, at least, it is just a difference in the underlying platform APIs. e.g. On iOS I'm constructing a CIImage, which requires Data. On Android, Java doesn't have raw pointers or unsigned integers, so [Int8] is required when passing a byte buffer to a Java API (especially one I don't have any control over). Rarely if ever have I gone out of my way to change buffer types to make it easier to work with, but usually because I have no choice. To that end, if it's somehow possible to avoid copying the buffer when going between Data, [Int8], and [UInt8], that would be ideal.

1 Like

Excellent work!

What bag of bytes types do you use in your code? Do you use different types for different purposes?

NIOCore.ByteBuffer most of the time, especially or networking code. But leaning more into Span recently when I only need parsing. For serialization, I tend to expose [UInt8] or ByteBuffer.

Why did you choose to use these types / what aspects led to these decisions?

  • NIOCore.ByteBuffer is the canonocal type for Swift-NIO, which powers most of my code.
  • Span is more generic and amazing for parsing, but OutputSpan is unwieldy (lack of realloc) so I avoid it.
  • Data is heavy, and incurs the binary size cost of Foundation, not just Data but also other Foundation types.
  • [UInt8] is cool technically, but I need an extra copy in almost all cases.

In general, what are important characteristics of bag of bytes types for your use cases?

  • Avoid copies, realloc-able when serializing.
1 Like

When first moving my macOS and iOS projects to Swift in 2014, I found that NSKeyed[Un]Archiver was plagued with serious bugs. As all my applications in the field stored data via that, I needed to find a better solution.

That led me to my own binary format where I leverage Data Legacy data was then converted to the new formats. Thankfully, read-only logic with NSKeyedUnarchiver was stable enough for my needs. And many years later, all that legacy code was removed.

I liked the Data API at the time (still do). Any performance gains with it would be great!

To dig into this a little bit, could you elaborate a bit more on the "Data is heavy" aspect? Not that I fully disagree, just trying to gain more insight and am curious if this relates to its ABI/codegen, its API surface, etc. Also curious if this is mostly about its location in Foundation (i.e. this would not be a concern if the type were in the stdlib or a standalone package) or if you view Data itself as too heavy for your use case.

I assume this is because many APIs don't accept a [UInt8] so to do anything with it you need to copy it to a Data/ByteBuffer for interaction with other APIs?

If it was in stdlib, that would change things. Right now, I don’t see it as a better API than [UInt8]. Better performance merely puts it at the same level, in a more difficult module.

API wise, Span is just more compatible for parsing. And I see Array as a more successful type than data, or at least I don’t see benefits in favor of Data.

ByteBuffer does actually have merits in terms of both API design and the separate reader and writer indices.

Right now, the only reason many of my libs import Foundation is the Date type - and I’d love not to do that too.

1 Like

I mostly use Span for decoding. That way, I can start with a Data, [UInt8], or any other bag of bytes type and reuse the same decoding code without hoping for specialization. Span also offers unchecked subscripting for cases where I know that the index is valid and so can save on bounds checking, something that is missing from many bag of bytes types.

Data is useful to read bytes from a file. Most of the time, I will create a Span and then work with that.

For encoding, I use UniqueArray<UInt8>. It prevents me from making accidental copies, eliminates exclusivity checks, and has no copy-on-write overhead. If I need to store the encoded data in-memory, I will create a [UInt8] as that is the most universal type for bytes. Encoded data often gets written to a file, so I can skip the conversion to another type in most cases.