Idea: Bytes Literal

duan · January 24, 2021, 9:01pm

Please consider this a "pre-pitch" (aka Daniel's longer-than-a-tweet).

Summary

I think Swift should have "bytes literals", supported by the compiler, as well as the standard library. Both 'hello' and '\x68\x65\x6C\x6C\x6F' would be valid bytes literals, representing the same value.

What is a byte literal?

They are similar to literals for strings/integers/bools. The standard library would provide similar treatment as for other literals. Specifically:

Byte - Signless 8-bit value, not an int but has bit-twiddling facilities
Bytes - Collection of bytes with continuous storage: random access, range replaceable, etc.
ExpressibleByBytesLiteral, BytesLiteralType

Why?

Swift should have first-class support for working with binary data. One could argue [UInt8] works just fine. But the signedness of 8-bit ints leaves a lot to be desired. Also, Array might not be the right low-level "bag-o-bits" container.

Bytes literal lets us embed binary data in programs. One of the cool features I want Swift to have in the future is something like Rust's include_bytes, which lets you specify a file path and the compiler will expand the source to include the file's content (like an image!). First-class language support makes it possible (I think embedding a array literal that represent [UInt8] would be...fine, if we don't get first-class bytes). In general, Swift users use Foundation.Data a lot. With Bytes, the stdlib would have a low-level support for this need. Having this collection-of-byte in the standard library helps with things like SR-920 as well: Bytes could back the improved IntegerLiteralType in the future; it could even be the public interface for literals of arbitrary precision. Signless bytes might even help represent "C strings", if we were to be bold.

Conclusion

A lot of stuff here. I want to gauge interests for this direction from the community with this post before committing to a pitch. Would Swift having these features help you? Is [UInt8] good enough? Counter points?

Related further reading

SE-243
SR-920
PEP-3112: Python's bytes literal
include_bytes: Rust's cool stonk

Max_Desiatov · January 24, 2021, 9:15pm

Yes please!
My own use case for this is UTF-16 strings in WebAssembly, parts of which are statically known at compile time, those are HTML element names, i.e. <div>, <input> etc. Other parts come from the browser's JavaScript environment, which uses UTF-16 instead of UTF-8, the actual elements content to be rendered. In the end I only need to concatenate them all and feed back to the JavaScript environment.

String is just too heavy for this, especially with UTF-16 <-> UTF-8 re-encoding back and forth. I'd rather operate on this as a byte string, but retain the ability to conveniently specify parts of that final rendered string at compile-time as UTF-16 bytes. I think what's described here would fit well, if only there's a way to specify how exactly byte literals are encoded?

Something like include_bytes or some other way to specify byte literals would be very handy. Currently available array literals just aren't good enough for that.

Lantua · January 24, 2021, 9:33pm

Instead of an entirely new Literal stack and syntaxes, maybe we can add only StaticBytes as part of StringLiteral, which can be used by ExpressibleByStringLiteral.

SDGGiesbrecht · January 24, 2021, 9:39pm

There is a lot of history here. I recommend you start by looking at the roadmap the core team provided in SE‐0243’s rejection.

duan · January 24, 2021, 9:55pm

I’m aware of the history. There’s a link to the proposal in my post.

xwu · January 24, 2021, 10:12pm

I'm wary of a proposal that attempts to tackle the breadth of use cases you cite here:

supplanting Foundation.Data as a currency type
embedding binary data into source
changing the internal backing for integer literals, future public interfaces
C strings

...and the breadth of types that you'd propose to add:

another representation of a byte (see discussion below)
a collection type
a literal type
a literal protocol
a new literal syntax

I agree that there is something to be desired that the bag-of-bytes currency type exists in Foundation, and certainly there are reasons to consider something lower level.

But, as you say, [UInt8] is a (very optimized) contiguously stored collection type, and UInt8 (as with all other integer types) purposely represents both integer values and the sequence of bits that make up their binary representation. You conclude that not making those changes would be "fine," while some of these other use cases are listed as hypotheticals as well. It's not clear to me what the motivating use case(s) ultimately is/are which byte literals are purported to address.

So my advice to you would be to approach this from the perspective of specific pain points you've encountered that you want to solve. What's the minimum required to actually address those things? I think that would make for a more tractable approach than fixing as your goal the addition of a syntax and working backwards to find any and all use cases which could justify it.

duan · January 24, 2021, 10:31pm

To be clear, I'm not proposing all of those things together. They are listed to demonstrate what first-class binary data could unlock. The core proposal is in the summary: the literal syntax, plus stdlib types. As I attempt in the post: it's not a fully flushed out pitch. In such pitch, these would list in rationale section.

To me, this appears to be the minimum of what first-class binary data support requires. Perhaps there are ways to unbundle them such that they'd be introduced gradually (even optionally). For now, it's just a sketch.

I've dealt with byte signedness working with imported C API many times over years, it's mildly frustrating whenever I need to cast a [Int8] to a [UInt8] or vice versa. Another factor here is that I'm aware of the suboptimal situation with Int/float literals in Swift today, which led me to thinking what an improved world on that front would be (what should back a arbitrary precision int literal, for example). But, to be honest, the real catalyst for me is the wish for include_bytes. It just happens that, in order for include_bytes to have a place in the language, I need an answer as to: include as what? What represents the data for a image natively in Swift?

All of these streams of thoughts, over years, precipitated to the conclusion that a native bytes type would help.

xwu · January 24, 2021, 10:38pm

I've had to deal with this as well. (Annoying--I agree!) My understanding is that this is fundamentally a problem with inconsistencies with the underlying platform and that Swift's CChar is our best-effort attempt at abstracting this away. I'm not sure I see how creating yet another type will do any better than CChar in this respect.

I do think that there are some approaches that can help smooth this over. For instance, I know that there has been some exploration of improving the ergonomics of CGFloat-to-Double conversions, which echoes this problem, and it would be interesting to think about this holistically when some of those explorations mature.

We already have arbitrary precision integer literals; the transport type is Builtin.IntLiteral and it's backed by APInt. These bugs remain open because we don't have arbitrary precision float literals yet, but the design is already sketched out in one of the bugs linked to the one that you cite. There's no impediment to making that a reality without the addition of new types: the limiting factor is elbow grease.

Why not [CChar]? I think that's what we currently recommend--see, for instance, the examples in the documentation of String.init(cString:):

let validUTF8: [CChar] = [67, 97, 102, -61, -87, 0]
validUTF8.withUnsafeBufferPointer { ptr in
    let s = String(cString: ptr.baseAddress!)
    print(s)
}

I guess what I'm saying, big picture, is that I totally sympathize with a lot of these pain points, but they really are disparate issues. Many of which have solutions at least sketched out which are natural outgrowths of work that's already been done, and I'm not sure how I see how byte literals move them forward.

duan · January 24, 2021, 11:09pm

I don't claim to be an export with C-interop/ClangImporter. Platform inconsistency match my impression as source of the problem as well.

In my mind, Byte should have been the fulfillment of CChar. Carrying around signedness means a user must have a deeper understanding wrt integer representation in Swift than necessary: is my data from this C library going to be okay? Should I be worried that it's [Int8] and not [UInt8]? In reality, I predict folks who works with C API most likely can answer that type of questions. But hey, law of large numbers.

Good to know. In my defense, my thoughts about problem originated before that ticket.

Thanks. Me and [CChar] have been well acquainted. Without such experience I don't see anyone could begin to generate a proposal like this. I supposed this part is subjective: it always rubs me the wrong way that we have to choose between [UInt8] and Foundation. I myself used NSData for a long time before I even thought about what it actually was back in the days. I might've been subconsciously missing the comfort from an opaque, mysterious bit bag like that. Finally, why CChar? It suggest that Swift's support for binary data is oriented around interfacing with C. That also feels wrong. Again, all subjective!

SDGGiesbrecht · January 24, 2021, 11:33pm

Did you read the part where the core team said that adding single quoted literals, fixing the ExpressibleBy protocols, and enabling characters as numbers/bytes would need to be three separate proposals? Your pitch seems to lump all three together without addressing any of the issues that caused its predecessor’s rejection.

(On the other hand, later comments seem to be more about having a Byte type and less about literals. If literals aren’t the point, then just ignore what I said.)

xwu · January 25, 2021, 12:11am

I think here's where the good stuff is!

It's worth fleshing out a write-up about your experiences with [CChar] and its shortcomings (or, at least, I'd be interested in reading about it).

What's been most cumbersome (besides signedness mismatches)? What have been the most error-prone workarounds you've had to use?

How much of it can be solved by adding new methods to an extension Array where Element == CChar? Or is there something there that really requires a newtype-like solution?

Chris_Lattner3 · January 25, 2021, 2:21am

I'd still personally love to see a narrow and well scoped proposal that allows the following to work: var x = 'a' as a Character.

Also var y : Int8 = 'a' seems pretty obvious as well.

-Chris

Jean-Daniel · January 25, 2021, 8:32am

It would be obvious in an ASCII world. In an Unicode world, it's not so simple.

Of course, one simple answer is "limit this feature to ASCII chars and had a section in Future Direction", but if I recall correctly, it caused much ink to flow in the past discussion, so I would not call this pretty Obvious ;-)

michaelgwelch · January 30, 2021, 1:56pm

Wait. What’s the difference between UInt8 and this proposed Byte? I obviously am missing some details about the internals?

jrose · February 1, 2021, 6:31am

I’m definitely in favor of adding more support around binary data that contains embedded text. Beyond that, though, the biggest improvement for me for working with binary data would be converging on [Prototype] Protocol-powered generic trimming, searching, splitting, – those are the utilities I add to Data whenever I work with byte formats. And I do use Data—it’s the type whose API is designed for aggregate byte work, even if it’s not quite the API I’d design. I’m not sure it’s worth adding another Bytes type to the Swift project.

How about Byte? C has the problem that char is used as the basis for uint8_t, the element type for string literals, and as the basis for raw byte manipulation through pointer casts, but Swift doesn’t have that problem. I don’t really know what you’d do with an individual Byte, and when you want to manipulate memory opaquely you use UnsafeRawPointer.

However, I have frequently wanted to use a literal string as a search term when scanning a Data. Data("foo".utf8) (or Array("foo".utf8)) isn’t terribly complicated, but it doesn’t handle the case you mention of “strings” containing non-UTF-8 bytes. So an ExpressibleByByteStringLiteral could be useful—and in fact, since every valid Unicode string is a valid byte string, one approach would be to not invent any new syntax, but to have ExpressibleByByteStringLiteral refine ExpressibleByStringLiteral, with the sole addition of the \xFF escape being valid. I would then argue that the one type in the stdlib to natively implement it would be UnsafeRawBufferPointer, with any other implementations (such as Data’s) built on top of that. It’s a little different from the other literal protocols, but I think it makes sense in practice, and it means #include() or whatever becomes a string literal, either a normal one or a byte string depending on the contents of the file.

This doesn’t solve ASCII byte literals. As much as I want to be Unicode-correct, I’m inclined to say Chris’s single-quote syntax is a way forward, as an ExpressibleByByteLiteral that only UInt8 and maybe Int8 conform to by default. I wouldn’t want to use single quotes for byte strings because it feels too subtle in the end (too easy to use the wrong one and get a weird error, or miss checking that you thought you’d get), but using it for individual bytes is well-precedented by C. (I don’t remember why the previous character literals proposal got stuck, but calling them “byte literals” helps some, at least for me.)

And if we want to save the single quote for other uses, we don’t actually need the close-quote. 'a, '\n…though '\' does look a little weird. :-)

benrimmington · February 1, 2021, 12:01pm

If the new protocol is only for UnsafeRawBufferPointer, then could it have a leading underscore, and only refine the existing _ExpressibleByBuiltinStringLiteral protocol?

The new protocol might restrict literals to ASCII, to avoid confusion when working with Latin-1, etc.

I think OSLog has some kind of compile-time interpolation; would that also be possible here?

import Foundation

// FIXME: @available(macOS 9999, iOS 9999, tvOS 9999, watchOS 9999, *)
extension Data: ExpressibleByStringInterpolation {

  public struct StringInterpolation: StringInterpolationProtocol {

    // FIXME: Use `UnsafeRawBufferPointer` instead.
    public typealias StringLiteralType = StaticString

    public private(set) var data: Data

    public init(literalCapacity: Int, interpolationCount: Int) {
      data = Data(capacity: literalCapacity + interpolationCount)
    }

    public mutating func appendLiteral(_ stringLiteral: StringLiteralType) {
      precondition(stringLiteral.isASCII)
      data.append(
        stringLiteral.utf8Start,
        count: stringLiteral.utf8CodeUnitCount
      )
    }

    public mutating func appendInterpolation(_ byte: UInt8) {
      data.append(byte)
    }

    public mutating func appendInterpolation(ascii: Unicode.Scalar) {
      data.append(UInt8(ascii: ascii))
    }

    public mutating func appendInterpolation(latin1: Unicode.Scalar) {
      data.append(UInt8(latin1.value))
    }
  }

  // FIXME: Use `UnsafeRawBufferPointer` instead.
  public typealias StringLiteralType = StaticString

  public init(stringLiteral: StringLiteralType) {
    precondition(stringLiteral.isASCII)
    self.init(
      bytesNoCopy: UnsafeMutableRawPointer(mutating: stringLiteral.utf8Start),
      count: stringLiteral.utf8CodeUnitCount,
      deallocator: .none
    )
  }

  public init(stringInterpolation: StringInterpolation) {
    self = stringInterpolation.data
  }
}

let data: Data = "\(0x89)PNG\r\n\u{1A}\n"
data.elementsEqual([137, 80, 78, 71, 13, 10, 26, 10]) //-> true

jrose · February 1, 2021, 5:22pm

Ah, it’s not just for URBP, because Data needs to conform too to make those \x escapes valid in your example. I did forget about interpolation, though, and diamond protocol hierarchies are…added complexity, at the very least. Thanks for bringing that up.

benrimmington · February 1, 2021, 6:03pm

I was hoping that by changing the StringLiteralTypes in my example (the FIXMEs), it would be enough to get \x support.

Although the requirements wouldn't change to the new protocol, they'd still be:

associatedtype StringLiteralType: _ExpressibleByBuiltinStringLiteral

lorentey · February 1, 2021, 10:56pm

Disorganized random thoughts:

This is not an easy problem, but it's worth solving. Swift (aspirationally) defines itself as a "high-performance systems programming language". Generating/processing binary data is very systemsy task; therefore, Swift ought to provide convenient and safe ways to efficiently deal with binary data.

I get worried by any design that implicitly assigns a numerical value to characters in a source file. h is not a byte, it's a letter from the latin alphabet. 'hello' isn't a bytes literal: it's an ASCII bytes literal. Names matter; and very visibly emphasizing the name of the encoding in the language docs would somewhat make up for its (very unfortunate) omission from the proposed 'hello' syntax.

(ASCII is an important enough encoding that it probably deserves some special treatment -- but I think we need to be careful not to mislead people into thinking that A is universally the same thing as 1000001 (or 01000001). For instance, byte data is often encoded in base64, where A means 000000.)

[UInt8], [CChar] etc aren't great choices as the default type for an ASCII bytes literal, because arrays bind their storage to their element type. Ideally, the default type should not assume that its storage is bound to any particular type -- i.e., under its hood, it should deal with raw pointers, not typed pointers.

One interesting capability of a standard (relatively) safe container of untyped bytes would be the ability to take ownership of the storage behind other contiguous collections (String, Array, Data) without copying them. This could be a nice escape from the tyranny of closure-based APIs such as withContiguous[Mutable]StorageIfAvailable.

In theory, we could just embrace Data and move it into the Standard Library; unfortunately it has some design aspects (such as self-slicing with integer indices) that make this a difficult pill to swallow. Given how widespread the use of Data is, I don't expect there would be much room for the Swift Evolution process to make radical changes to it -- the source compatibility implications (never mind potential ABI issues) would be staggering. Introducing a new type could make for a smoother migration path, but it would still feel like a return to the days of Swift 3.

API-wise, I imagine the ideal safe byte buffer type would provide ways to easily reinterpret its contents as particular (trivial) types. (Through typed views like buffer[as: UInt16.self, endian: .little], through something like UMRBP's load(fromByteOffset:as:)/storeBytes(of:toByteOffset:as:) methods or maybe something like Swift NIO's ByteBuffer read/write methods.)

lukasa · February 2, 2021, 9:31am

I've been thinking about this for a while and I'm increasingly pessimistic on the "one size fits all" "container of bytes" type. We could definitely add it, but too many of the pre-existing solutions have frozen representations that make them impossible to back with this type without breaking ABI.

I'm increasingly convinced that instead of trying to go back in time and centralise on a single basic "bucket o' bytes" type, we should make it much easier for frameworks to accept whatever bucket the user happens to have.

To use an example I'll consider swift-protobuf. Here's an example from the README:

// Create a BookInfo object and populate it:
var info = BookInfo()
info.id = 1734
info.title = "Really Interesting Book"
info.author = "Jane Smith"

// Serialize to binary protobuf format:
let binaryData: Data = try info.serializedData()

// Deserialize a received Data object from `binaryData`
let decodedInfo = try BookInfo(serializedData: binaryData)

In this instance, swift-protobuf has serialisation and deserialisation defined against Data: serializedData always returns a Data, and .init(serializedData:) always takes a Data. This is a fine enough default, but it means that if you happen to either have something that's not a Data (such as a ByteBuffer, [UInt8], UnsafeRawBufferPointer, or some custom type) or need something that's not a Data, you will have to incur an extra heap allocation and an extra copy to move between the two representations.

(Author's note: yes, I am aware of Data.init(bytesNoCopy:). This makes life moderately easier on ingestion if you carefully hold your types just right, but there is no escape on the serialise side of things.)

A better world would be one where we could define a common baseline interface that frameworks like protobuf can use, and then conform our existing bucket types to it. That would allow users to work with the data types they have and need, rather than be forced to transform to whichever ones the framework authors decided to privilege.

For deserialisation this almost exists already. Foundation has ContiguousBytes, which is borderline the correct answer to this problem. It defaults to the basic operation of "give me a pointer to your initialised storage", and can be used to bootstrap most parsing operations.

For serialisation things are a lot harder, mostly because many serialisation formats don't know ahead of time how much space they need. This forces us to define a data type that can be reallocated. No such protocol exists today, though it could probably be defined with minimal effort.

If we had a nice native deserialisable protocol, these methods could be implemented on top of that protocol and all conforming types would get the implementation for free. Seems like a win!

All of this is somewhat orthogonal to the idea of a bytes literal. A bytes literal, to my mind, should probably just vend a static buffer from the binary. This would match nicely with the other proposals for include_bytes and friends. This, again, allows us to wrap this static data in whatever data type we want, instead of having to bless a single type.