Idea: Bytes Literal

benrimmington · February 2, 2021, 10:55am

Multiline string literals currently require a line break immediately after the opening quotation marks.

We could allow certain keywords in this position, to have different kinds of data literals.

RFC 4648 data literals:

"""base16
E80C8931DC0E2CA7626164848DC47A78
4AA196AEAED8F5C1CF99D3EA7D3DE700
17F88D27814F5ED19C36
"""

"""base64
wmVy+kecoLYY8GFPqkmMStybVfy00HKZ
e6uoE17kpx2m1uy1lofsPFHKgYnRT+6c
9458h/d3DU9rT+O5P9s=
"""

ASCII-based data literals:

"""latin1
ASCII or Latin-1 data literal with an \
escaped line break.

Escaped characters:
* \0 \\ \t \n \r \" \'
* \u{0}...\u{FF}

Escaped bytes:
* \{0b0}...\{0b1111_1111}
* \{0o0}...\{0o377}
* \{0x0}...\{0xFF}
* \{0}...\{255}
"""

Max_Desiatov · February 2, 2021, 11:21am

I like the idea of specifying the actual quotation mechanism. Maybe it could be explored separately as something along the lines of Scala quasiquotes or Rust macro expansion?

Imagine something like a "literal wrapper" that could allow a certain subset of functions (probably only total, i.e. pure and non-recursive) to process an arbitrary literal at compile time?

// evaluated at compile time
@literalWrapper func base64(_ literal: StaticString) -> [UInt8] {
  // decode a sequence of bytes from the base64 literal here
  return decodedBytes
}

// the resulting compiled binary would contain decoded data,
// this would avoid decoding base64 at run time
let base64EncodedData: [UInt8] = @base64 """
wmVy+kecoLYY8GFPqkmMStybVfy00HKZ
e6uoE17kpx2m1uy1lofsPFHKgYnRT+6c
9458h/d3DU9rT+O5P9s=
"""

This could be expanded to work on an arbitrary type and throwing literal wrappers that can fail during decoding (base64 literal wrapper can be made throwing accordingly). For example, I find it annoying that URL.init(string:) is optional even for static strings that are known to be valid URLs at compile time. What if a literal wrapper could parse URL literals and guarantee they parse to a valid URL or throw a compiler error otherwise?

@literalWrapper func url(_ literal: StaticString) throws -> URL {
  guard let result = URL(string: literal) else { throw InvalidURLError() }
  return result
}

// guaranteed to be a non-optional URL,
// otherwise compilation will fail with InvalidURLError
let url = @url "https://httpbin.org/uuid"

I'm aware of the previous @compilerEvaluable pitches, but I wonder if we could approach it from a slightly different direction here that would cover byte literals and other literals too?

Max_Desiatov · February 2, 2021, 1:34pm

To elaborate on this, in the literals wrapper approach we'd probably need something like this protocol:

protocol BytesRepresentable {
  var bytes: [UInt8]
  init(bytes: [UInt8])
}

This protocol would specify how exactly to embed a literal processed at compile-time in the final binary. Thus a function marked as @literalWrapper would be required to return something that conforms to BytesRepresentable. In this case

let url = @url "https://httpbin.org/uuid"

desugars into

let url = URL(
  // the resulting value of the `bytes` property below needs
  // to be replaced with a corresponding [UInt8] literal at
  // compile time
  bytes: <result of the wrapper call at compile time>.bytes
)

If we carefully lift the purity restriction for literal wrapper functions, something like include_bytes from the original post could look like this:

@literalWrapper
func includeBytes(_ filepath: StaticString) throws -> [UInt8] {
  try [UInt8](Data(contentsOf: filepath))
}

let bytesFromFile = @includeBytes "file.blob"

Jon_Shier · February 2, 2021, 4:22pm

How would such a unifying protocol work without a common currency type? At the least you'd need some sort of common byte type, no? I guess you could abstract that as well, but that doesn't seem to be very high performance.

In any case, I agree we need a unifying protocol, but I think we also need a common type in the standard library. We may not be able to meet every requirement from across the ecosystem, but a common type should allow a good starting place for common protocols and standard library features.

lukasa · February 2, 2021, 4:47pm

UInt8 is and remains the common byte type.

We have a common currency type: UnsafeMutableRawBufferPointer. As a sketch, we could imagine the protocol being this:

protocol AppendableRawBuffer: ContiguousBytes {
    mutating func withUnsafeUnitializedTrailingBytes<ReturnType>(minimumCapacity: Int, _ block: (UnsafeMutableRawBufferPointer, inout Int) throws -> ReturnType) rethrows -> ReturnType

    var count: Int { get }
}

This is probably the minimal viable feature set required to serialise data into a buffer. It's a bit of a pain to perform some operations (e.g. back-to-front serialization), but a basic forward-moving serialisation can be cheaply performed using this abstraction. It has all the necessary moving pieces. Indeed, this is the fundamental operation that NIO's ByteBuffer provides that is not currently available in Data or [UInt8].

I think we need to know what this type is for. Any new currency type cannot be transformed into [UInt8] without an alloc-and-copy (as [UInt8] is both a) frozen and b) always owns its storage), and could only be transformed into a Data (in the best case, it's possible we'd need the copy too) with a heap-allocation for a closure context to manage reference counts of backing storage (as Data is frozen, so we can only handle this with custom deallocator functions).

So I think we do need to ask what we think the common currency type buys us. What problem is it trying to solve?

lukasa · February 2, 2021, 4:49pm

Max_Desiatov:

If we carefully lift the purity restriction for literal wrapper functions, something like include_bytes from the original post could look like this:
@literalWrapper
func includeBytes(_ filepath: StaticString) throws -> [UInt8] {
  try [UInt8](Data(contentsOf: filepath))
}

let bytesFromFile = @includeBytes "file.blob"

Bytes included in the file cannot be efficiently represented in [UInt8] because Array is frozen, and supports only the empty representation or a heap-allocated representation. The baseline data type would have to be UnsafeRawBufferPointer.

benrimmington · February 2, 2021, 6:00pm

If we only support a limited set of encodings, they can be represented as string_literal in SIL.

%1 = string_literal base16 "17F88D27814F5ED19C36" // $Builtin.RawPointer

%2 = string_literal base64 "9458h/d3DU9rT+O5P9s=" // $Builtin.RawPointer

%3 = string_literal latin1 "\t \n \r \" ' \0...ÿ" // $Builtin.RawPointer

The ascii and utf8 encodings would use String as the default type.

The base16, base64, latin1, and utf16 encodings would use UnsafeRawBufferPointer as the default type.

Karl · February 2, 2021, 6:21pm

URLs are complicated by the fact that there have been multiple URL standards, none of which were strictly adhered to, resulting in a recent new URL standardisation effort which is in fact a living document because it's such a difficult thing to nail down.

That really needs to be emphasised - for instance, recently there was a change which stopped stripping leading empty components from file URL paths. Previously, a URL like file:////////foo with a bunch of leading empties would get normalised down to file:///foo. This was a special behaviour that only applied to file URLs, and probably has its roots in the fact that most POSIX systems ignore leading empty components (it's "implementation defined", but IIUC essentially every system ignores it). Anyway, it turns out that the majority of browsers weren't actually following that part of the spec (maybe because the OS was doing it for them), so in the name of compatibility, the spec was simplified so it no longer stripped those leading empties. Sounds simple - except that it means that strings file:///foo and file:///////foo now produce different URLs. That ended up breaking software which was caching data by URL and expected those empty components to get collapsed away.

Again, living document. There have also been bugs fixed recently relating to URLs not being idempotent (a URL being re-parsed and giving a different result). I'm a bit apprehensive of what would happen if we generated a URL record based on the latest spec at compile-time, and that differed from the spec which other software expected at run-time.

In the words of the spec, it is being developed because "URL parsing needs to become as solid as HTML parsing". It's great that there's renewed effort, but that's still an aspiration as of now.

xwu · February 2, 2021, 6:27pm

I'm not sure I understand the rationale between the proposed differences here in latin1 and ascii, utf8 and utf16. A String value represents a Unicode string independent of its underlying encoding, and I would argue strongly that we shouldn't muddy the waters here for strings and string literals.

To my mind, the base16 and base64 examples are different in kind because those literal values represent the encoded versions of binary data (semantically). It would certainly be interesting to consider if a literal syntax (perhaps using the currently unused single quotation marks) could be adopted for that purpose. That is to say, for various literal representations of bytes (of binary data), in contradistinction to representations of strings (of characters, however encoded). I think it would be enormously beneficial conceptually to separate the idea entirely from strings and string encodings.

benrimmington · February 2, 2021, 7:11pm

utf8 is the default implicit encoding, used by all string literals in current Swift.
ascii would be a compile-time guarantee (in case that's useful) to limit the allowable characters.
latin1 would be an encoding where escaped bytes (e.g. \xFF or \{0xFF}) are allowed.
utf16 would be for the WebAssembly use-case in post #2.

If we exclude ascii, then all explicit encodings are for data literals.

Max_Desiatov · February 2, 2021, 7:36pm

I wonder if within the literal wrappers idea we could have a couple of them being "magical", i.e. optimizable to SIL commands directly, but still visible in the API as normal functions. Then when generating SIL, @utf16 "abcdef" would be lowered as string_literal utf16 "abcdef", while still allowing the rest of the user-defined literal wrappers to be evaluated properly at compile time?

xwu · February 2, 2021, 8:26pm

Seems like UTF-8 would be just as useful for bytes/data literals.

As for ASCII, for the reasons stated above, I'd be wary of muddying the water about the encoding of string literals--we're fundamentally discussing different semantics here ("How do I ergonomically spell an arbitrary sequence of bytes?" versus "How do I choose a encoding for my literal string?"). There is no default encoding for string literals. The String type might now default to UTF-8 for its internal representation, but we didn't spell literals any differently back when it defaulted to UTF-16.

But I think I see more clearly now the use case for a bytes/data literal and how it would dovetail with the standard library as it exists today. I do think we need to be careful to define the problem carefully (and narrowly), else we run the danger of projecting onto it all sorts of superficially related but ultimately disparate issues and goals.

I see @lukasa's point above that the default data type for such a literal would have to be UnsafeRawBufferPointer. It would make sense to me to go back to some of what @duan initially proposed here, building on that starting point: Suppose we adopt 'foo' as the spelling for a raw buffer literal; this literal could be constrained not to accept arbitrary Unicode but only ASCII and escape sequences, ideal when the underlying byte sequence is of import rather than the composed characters (since more than one Unicode sequence can represent the same character).

We would then offer as part of the standard library a common protocol ExpressibleByRawBufferLiteral, to which types like Data could conform; such types would have to implement init(rawBufferLiteral: Self.RawBufferLiteralType) where Self.RawBufferLiteralType: _ExpressibleByBuiltinRawBufferLiteral.

In the standard library, we would define typealias DefaultRawBufferLiteralType = UnsafeRawBufferPointer and conform that type to _ExpressibleByBuiltinRawBufferLiteral. Other first-party types might be considered for magical conformance to that protocol also.

We could add convenience methods as needed to UnsafeRawBufferPointer to make working with that type more ergonomic, and we could consider additional syntax specific to the raw buffer literal (such as \x42 escape syntax, specifying base64 encoding, etc.). In fact, I wonder if it would be reasonable (if we had base64 encoding) to allow arbitrary trailing = to pad the raw buffer to a specified size.

Jean-Daniel · February 2, 2021, 8:47pm

I'm a little lost in this discussion, but why a base type to store raw data must be continuous ? Such requirement would make usage of type like dispatch_data uselessly complicated or inefficient.

lorentey · February 2, 2021, 9:17pm

I agree! However, the bucket-o-bytes type I have in mind is only intended to go the other way -- it would be able to work with whatever storage representation you have, but there is no expectation that other types would directly adopt it as a storage representation.

struct ByteBucket {
  let owner: AnyObject?
  let buffer: UnsafeRawBufferPointer
}

Note how this is just an UnsafeRawBufferPointer that maintains an owning reference to its storage.

We already have UnsafeRawBufferPointer as a universal byte bucket type. We can create an URBP over any contiguous storage, no matter how it's represented. The Standard Library is already using this type (and UnsafeBufferPointer) as a way to provide direct access to storage (of whatever form) without getting bogged down with details about how that storage is represented. This problem does not need to be solved with a protocol.

The major problem with URBP is that it is unsafe -- it neither owns its storage, nor does it perform bounds checking in production code -- and this makes its use questionable in all but the simplest situations. Introducing a safe(r) buffer pointer variant would let us keep the advantage of having a non-generic universal bucket o' bytes type while also allowing us to safely pass these buckets through thread boundaries etc.

Data clearly wants to be the universal byte bucket type, and it serves that role in Apple's SDKs. However, Data has some issues.

It is defined in the wrong module (note: this can be fixed)
It has evolved largely outside of the Swift Evolution Process
It has issues with representational complexity (e.g. Data.Deallocator, __DataStorage._offset, ...)
It has issues with its API (e.g. integer indices vs self-slicing, mutable count, ...)

Foundation also defines a piecewise contiguous byte bucket protocol called DataProtocol. It has not seen widespread adoption.

lorentey · February 2, 2021, 9:47pm

Using an unsafe type as the default type of a bytes literal (of whatever encoding) would be unfortunate -- once we do settle on standard safe byte bucket type, we'd immediately regret that choice. (We'd end up in a similar situation as Objective-C & C++, where regular string literals produce a C pointer.)

In the current Swift ecosystem, 'hello' wants to be of type Data.

lorentey · February 2, 2021, 10:20pm

Piecewise contiguous representations pop up very often, and they can and should be modeled as a sequence of contiguous chunks, with Unsafe[Raw]BufferPointer (or a safe alternative) as the chunk type. I.e., piecewise contiguous representations will be built on top of the contiguous primitive.

I think support for accessing these chunks ought to be built directly into the existing Sequence protocol hierarchy, by adding new primitive requirements -- such as these:

protocol IteratorProtocol {
  ...
  mutating var isSegmented: Bool { get }
  mutating func withNextUnsafeSegment<R>(
    maximumCount: Int?,
    _ body: (UnsafeBufferPointer<Element>) throws -> R
  ) rethrows -> R?
}

protocol Collection {
  ...
  func withUnsafeSegment<R>(
    startingAt start: Index,
    maximumCount: Int?,
    _ body: (UnsafeBufferPointer<Element>) throws -> R
  ) rethrows -> (end: Index, result: R)
}

Efficient algorithms for e.g. copying data across such data structures can be built on top of these. (These are generalizations of the existing, undocumented _copyContents requirement.)

Note that we're currently missing a Sequence/IteratorProtocol equivalent for untyped storage. I expect we'll want to add one, likely defined entirely in terms of chunked access. Work on this is probably best done once we have settled the representation of the contiguous chunks. (And once we've already solved the (easier) typed storage case.) I think this topic is largely independent of this discussion.

Philippe_Hausler · February 2, 2021, 10:53pm

This looks pretty close to the regions API on Data, perhaps that might be an interesting exercise to delve into a bit more?

lukasa · February 3, 2021, 10:37am

lorentey:

I agree! However, the bucket-o-bytes type I have in mind is only intended to go the other way -- it would be able to work with whatever storage representation you have, but there is no expectation that other types would directly adopt it as a storage representation.
struct ByteBucket {
  let owner: AnyObject?
  let buffer: UnsafeRawBufferPointer
}
Note how this is just an UnsafeRawBufferPointer that maintains an owning reference to its storage.

How do you envision such a type being used to address my use-case of "I would like to serialise into/parse out of a bucket of bytes type X"? In particular, how does it address the question of resizing the buffer?

benrimmington · February 3, 2021, 11:06am

I think using single quotes for ASCII is a good idea. We could then have single-line and multiline literals.

The position after an opening ''' or """ could be reserved for line break options (LF by default; CRLF for certain data formats).

If we had a new protocol hierarchy (similar to String, Character, and Unicode.Scalar literals):

_ExpressibleByBuiltinRawBufferLiteral as you've suggested.
- UnsafeRawBufferPointer (or a safe wrapper?) by default.
- UnsafeRawPointer with null-terminator for C/C++ interop.
_ExpressibleByBuiltinIntegerLiteral for base256 integer literals.
- All integer (and floating-point?) types in the standard library.
- e.g. UInt32('ABCD') is UInt32(0x41_42_43_44).
- e.g. CChar('\n') is CChar(0x0A) as expected.
- e.g. CChar("\n") is nil — SR-747.

For simplicity, perhaps base64 can be omitted?

The \x could support base16, by adding braces, and ignoring line breaks, etc.

'''
Bytes literal with an \
escaped line break.

Escaped ASCII characters:
* \0 \\ \t \n \r \" \'
* \u{0}...\u{7F}

Escaped base16 bytes:
\x{
  E80C8931DC0E2CA7626164848DC47A78
  4AA196AEAED8F5C1CF99D3EA7D3DE700
  17F88D27814F5ED19C36
}
'''

lorentey · February 6, 2021, 1:56am

Mutations (including resizing) would execute in-place only when the owner is uniquely referenced, and of a particular "native" type. This matches how standard collections like String deal with mutations of bridged (or otherwise "foreign") instances. (And in fact the in-memory layout of such a byte bucket would just be a heavily simplified version of String.) This representation would also allow for read-only immortal bytes, such as ones generated at compile-time from bytes literals.

Because storage would be required to be contiguous, we wouldn't be able to wrap an entire dispatch_data or a ring buffer into these. However, we could still use these to unify the representation of their contiguous pieces. (dispatch_data would be something like a sequence of these buffers, a little like Data vs DataProtocol, but hopefully with a more practical design.)

I really don't think we can introduce a bytes literal syntax (of whatever encoding and features) without having a safe standard library type that can serve as their default type. None of [UInt8], [CChar], String, Unsafe[Raw]BufferPointer or Data seem appropriate for this role to me. (Data definitely comes closest -- but only as long as we don't look too close.)

(Introducing special syntax for byte literals that can only initialize individual UInt8/Int8 values (as in let a: Int8 = 'a') seems like a waste of effort -- if that's all we want, we could just define a bunch of namespaced constants for the ASCII character set.)