Idea: Bytes Literal

I'm a little lost in this discussion, but why a base type to store raw data must be continuous ? Such requirement would make usage of type like dispatch_data uselessly complicated or inefficient.

I agree! However, the bucket-o-bytes type I have in mind is only intended to go the other way -- it would be able to work with whatever storage representation you have, but there is no expectation that other types would directly adopt it as a storage representation.

struct ByteBucket {
  let owner: AnyObject?
  let buffer: UnsafeRawBufferPointer
}

Note how this is just an UnsafeRawBufferPointer that maintains an owning reference to its storage.

We already have UnsafeRawBufferPointer as a universal byte bucket type. We can create an URBP over any contiguous storage, no matter how it's represented. The Standard Library is already using this type (and UnsafeBufferPointer) as a way to provide direct access to storage (of whatever form) without getting bogged down with details about how that storage is represented. This problem does not need to be solved with a protocol.

The major problem with URBP is that it is unsafe -- it neither owns its storage, nor does it perform bounds checking in production code -- and this makes its use questionable in all but the simplest situations. Introducing a safe(r) buffer pointer variant would let us keep the advantage of having a non-generic universal bucket o' bytes type while also allowing us to safely pass these buckets through thread boundaries etc.

Data clearly wants to be the universal byte bucket type, and it serves that role in Apple's SDKs. However, Data has some issues.

  • It is defined in the wrong module (note: this can be fixed)
  • It has evolved largely outside of the Swift Evolution Process
  • It has issues with representational complexity (e.g. Data.Deallocator, __DataStorage._offset, ...)
  • It has issues with its API (e.g. integer indices vs self-slicing, mutable count, ...)

Foundation also defines a piecewise contiguous byte bucket protocol called DataProtocol. It has not seen widespread adoption.

4 Likes

Using an unsafe type as the default type of a bytes literal (of whatever encoding) would be unfortunate -- once we do settle on standard safe byte bucket type, we'd immediately regret that choice. (We'd end up in a similar situation as Objective-C & C++, where regular string literals produce a C pointer.)

In the current Swift ecosystem, 'hello' wants to be of type Data.

2 Likes

Piecewise contiguous representations pop up very often, and they can and should be modeled as a sequence of contiguous chunks, with Unsafe[Raw]BufferPointer (or a safe alternative) as the chunk type. I.e., piecewise contiguous representations will be built on top of the contiguous primitive.

I think support for accessing these chunks ought to be built directly into the existing Sequence protocol hierarchy, by adding new primitive requirements -- such as these:

protocol IteratorProtocol {
  ...
  mutating var isSegmented: Bool { get }
  mutating func withNextUnsafeSegment<R>(
    maximumCount: Int?,
    _ body: (UnsafeBufferPointer<Element>) throws -> R
  ) rethrows -> R?
}

protocol Collection {
  ...
  func withUnsafeSegment<R>(
    startingAt start: Index,
    maximumCount: Int?,
    _ body: (UnsafeBufferPointer<Element>) throws -> R
  ) rethrows -> (end: Index, result: R)
}

Efficient algorithms for e.g. copying data across such data structures can be built on top of these. (These are generalizations of the existing, undocumented _copyContents requirement.)

Note that we're currently missing a Sequence/IteratorProtocol equivalent for untyped storage. I expect we'll want to add one, likely defined entirely in terms of chunked access. Work on this is probably best done once we have settled the representation of the contiguous chunks. (And once we've already solved the (easier) typed storage case.) I think this topic is largely independent of this discussion.

This looks pretty close to the regions API on Data, perhaps that might be an interesting exercise to delve into a bit more?

1 Like

How do you envision such a type being used to address my use-case of "I would like to serialise into/parse out of a bucket of bytes type X"? In particular, how does it address the question of resizing the buffer?

I think using single quotes for ASCII is a good idea. We could then have single-line and multiline literals.

The position after an opening ''' or """ could be reserved for line break options (LF by default; CRLF for certain data formats).

If we had a new protocol hierarchy (similar to String, Character, and Unicode.Scalar literals):

  1. _ExpressibleByBuiltinRawBufferLiteral as you've suggested.

    • UnsafeRawBufferPointer (or a safe wrapper?) by default.
    • UnsafeRawPointer with null-terminator for C/C++ interop.
  2. _ExpressibleByBuiltinIntegerLiteral for base256 integer literals.

    • All integer (and floating-point?) types in the standard library.
    • e.g. UInt32('ABCD') is UInt32(0x41_42_43_44).
    • e.g. CChar('\n') is CChar(0x0A) as expected.
    • e.g. CChar("\n") is nil โ€” SR-747.

For simplicity, perhaps base64 can be omitted?

The \x could support base16, by adding braces, and ignoring line breaks, etc.

'''
Bytes literal with an \
escaped line break.

Escaped ASCII characters:
* \0 \\ \t \n \r \" \'
* \u{0}...\u{7F}

Escaped base16 bytes:
\x{
  E80C8931DC0E2CA7626164848DC47A78
  4AA196AEAED8F5C1CF99D3EA7D3DE700
  17F88D27814F5ED19C36
}
'''

Mutations (including resizing) would execute in-place only when the owner is uniquely referenced, and of a particular "native" type. This matches how standard collections like String deal with mutations of bridged (or otherwise "foreign") instances. (And in fact the in-memory layout of such a byte bucket would just be a heavily simplified version of String.) This representation would also allow for read-only immortal bytes, such as ones generated at compile-time from bytes literals.

Because storage would be required to be contiguous, we wouldn't be able to wrap an entire dispatch_data or a ring buffer into these. However, we could still use these to unify the representation of their contiguous pieces. (dispatch_data would be something like a sequence of these buffers, a little like Data vs DataProtocol, but hopefully with a more practical design.)

I really don't think we can introduce a bytes literal syntax (of whatever encoding and features) without having a safe standard library type that can serve as their default type. None of [UInt8], [CChar], String, Unsafe[Raw]BufferPointer or Data seem appropriate for this role to me. (Data definitely comes closest -- but only as long as we don't look too close.)

(Introducing special syntax for byte literals that can only initialize individual UInt8/Int8 values (as in let a: Int8 = 'a') seems like a waste of effort -- if that's all we want, we could just define a bunch of namespaced constants for the ASCII character set.)

The main use case I have for this is switches, but I guess I could match a byte string literal against a prefix instead of a single byte.

2 Likes

One interesting aspect of reusing existing string literal syntax is that it sidesteps this question quite nicely. It's still weird that there's no safe bytes type in the standard library, sure, but you won't get an unsafe or suboptimal type by accident.

3 Likes

Instead of adding new protocols (for byte strings and interpolation), could we simply add another initializer to the existing _ExpressibleByBuiltinStringLiteral protocol?

  • The new initializer would only be used when a string literal contains \x{...}.

  • A default implementation (if needed) would forward to the existing initializer.

  • String would replace invalid UTF-8 bytes with U+FFFD.

  • StaticString might allow invalid UTF-8 bytes, by updating preconditions, and adding APIs:

    • e.g. public var isUTF8: Bool (using a spare bit in _flags).
    • e.g. public var bytes: UnsafeRawBufferPointer (can be empty).
  • Byte string interpolation would use the existing protocols, with their StringLiteralType: _ExpressibleByBuiltinStringLiteral associated types.


Off-topic

Could the previous Unicode scalar literals pitch be solved by having a shorter type name?

public typealias Rune = Unicode.Scalar
extension StringProtocol {
  public typealias RuneView = Self.UnicodeScalarView
  public var runes: RuneView { self.unicodeScalars }
}

let rune = Rune("\u{1F600}")
rune.properties.isEmoji  //-> true
rune.properties.name     //-> "GRINNING FACE"
rune.utf8.count          //-> 4

let runes = "ABC".runes
runes.allSatisfy(\.isASCII)  //-> true
runes.count                  //-> 3

I think this approach seems reasonable to me, I can see how it would be made to work. I'd be interested in work in this direction.

Relatedly, UInt8.init(ascii:) happens to be a shortcut way to this thing. Under optimisation this will compile to the UInt8 value of a single ASCII character. Of course, this behaviour is non-obvious due to the absence of clear constexpr semantics in Swift, and it'd be nice to have something that would more clearly perform this operation.

1 Like

I think there are kind of 2 discussions happening here - one about data literals (which presumably would look/work something like StaticString, support @compilerEvaluable when that's a thing, etc), and another about our general handling of owned data.

WRT the first discussion: it's good. We should do it.

WRT the second discussion:

It's interesting - I have quite a pressing need for something like this, so I'd be happy to help in any way I can. Similar issues came up in the shared string thread, and in discussion I've had with others about how to drive that feature forward. I also kind of touched on that topic when Array.init(unsafeUninitializedCapacity:, initializingWith:) was proposed, and again when the string version was proposed -- so it's something that has irritated me a bit over the years.

IMO, I should be able to go from ManagedBuffer<Header, UInt8> -> Array<UInt8> -> String without copying. At least for read-only use cases.

A simple (pointer, owner) pair object would be really great as a starting point, to represent "some pointer kept alive by an ARCed object". I wonder if the owner object could be anything more than a dumb ARCed thing, and maybe conform to protocols which we could as? cast to enable more functionality, such as checking for uniqueness or setting the count of initialized elements.

Most use-cases I have would require the buffer to be typed, though. For shared strings, I think we'd probably want to use a buffer of UInt8s rather than a raw buffer (UInt8 of course being the UTF8 codepoint type). Unfortunately this would lead to another RawByteBucket/ByteBucket<T> split :frowning_face:

We could definitely use better APIs for inserting contiguous data in to contiguous collections! I had a need for something like this recently (for a similar reason; I'm simplifying a path string and serialising the result in a generic container, starting at the last path component and working towards the front), so I've been using the following family of functions. They've been incredibly useful.

/// Appends space for the given number of objects, but leaves the initialization of that space to the given closure.
///
/// - important: The closure must initialize **exactly** `uninitializedCapacity` elements.
///
mutating func unsafeAppend(
  uninitializedCapacity: Int, 
  initializingWith initializer: (inout UnsafeMutableBufferPointer<Element>) -> Int
)

mutating func unsafeReplaceSubrange(
  _ subrange: Range<Index>,
  withUninitializedCapacity newSubrangeCount: Int,
  initializingWith initializer: (inout UnsafeMutableBufferPointer<Element>) -> Int
)

Basically, it inserts some uninitialised capacity at the given place, and gives you a closure to initialise it out-of-order, in a similar fashion to Array.init(unsafeUninitializedCapacity: initializingWith:).

(The Int return value is supposed to be an independently-calculated version of how many elements the closure actually wrote. The implementation traps if you fail to initialise the entire inserted capacity).

A worthy goal, but one that Swift cannot easily achieve. Array<UInt8> and String are frozen: their representations cannot change without breaking the ABI. There is no space to change their layout to any one that does not own its storage.

In principle other types could be implemented to provide String-like operations on top of the raw type @lorentey envisions here, but those core types are set in stone until some justification comes along to break ABI.

Luckily, String's ABI already has provisions for shared storage. You're correct about Array, though - that's unlikely to be achievable at this point (although the ABI is only a concern for Darwin platforms, and those support bridging to custom NSArray subclasses, so maybe there's a way around it).

2 Likes

It does? I am not aware of this...

Yep :slight_smile:

See @allevato 's prototype implementation PR. It's really just the public entrypoints that are needed (and figuring out how we can make them safe, etc - see the shared substring thread for more info about the concerns there).

2 Likes

Yes, I intentionally kept this out of inlinable code and orthogonal to whether the object is ObjC. String supports resilient shared and resilient foreign representations. Shared means it can give a pointer to contiguous UTF-8 as the result of a (not inlinable, read-only) function call, as opposed to masking off some biased bits for native. Shared still participates in the majority of the fast path. Foreign has no constraints, nor any extra performance guarantees. I did not bake in the assumption foreign is UTF-16 encoded, at least in the ABI (since -length is at least a function call anyways).

I have (jokingly) been referring to a common interchange format as a "deconstructed COW" or simply struct ๐Ÿ’ฅ๐Ÿฎ , so that you can do the following:

  • Array -> ๐Ÿ’ฅ๐Ÿฎ
  • String -> ๐Ÿ’ฅ๐Ÿฎ (might copy if foreign, might allocate if small) -> String (shared)

And the :cow:-ness reflects a common agreement that the AnyObject? field is copy-on-write, that is if any holder of the owner has guaranteed uniqueness, it can do an in-place mutation. ๐Ÿ’ฅ๐Ÿฎ allows for exclusive ownership of storage.

This could then be adopted by other types such as Data, ByteBuffer, etc., as appropriate.

5 Likes

So, if I understand your vision, we could then theoretically also have:

String -> ๐Ÿ’ฅ๐Ÿฎ -> ByteBuffer (shared)
Array -> ๐Ÿ’ฅ๐Ÿฎ -> ByteBuffer (shared)
ByteBuffer -> ๐Ÿ’ฅ๐Ÿฎ -> ByteBuffer (shared)
ByteBuffer -> ๐Ÿ’ฅ๐Ÿฎ -> String (shared)
(But not ByteBuffer -> ๐Ÿ’ฅ๐Ÿฎ -> Array (shared), because Array's ABI is frozen.)

(Also, may I congratulate you on the fantastic name.)

5 Likes

Correct, assuming ByteBuffer supports a shared representation. There are potentially many variants (such as one that includes an inline buffer, segmented storage, just a protocol conformance, etc), but most of the advantage comes from a single interchange type.

There are lots of non-obvious but important design details here. There are potential issues with typed memory access (e.g. we should forbid Array<AnyObject> -> ๐Ÿ’ฅ๐Ÿฎ -> Array<UInt8>).

We would want to carve off extra flags bits, such that we're not storing AnyObject? but rather something akin to a future _NativeObject type, which would be like _BridgeObject but without the Objective-C half-bit. Flag bits could also be useful for runtime consistency or safety checks (e.g. tracking memory binding status), etc.

We'd want to be careful about going from a nil owner to a shared storage type that is proclaimed memory-safe such as String. String is memory safe and can support a nil owner for literals, but we'd have to think long and hard before we allow a shared string to be constructed from an arbitrary pointer whose lifetime is unknown. If we are claiming that the result is a proper shared string, then that includes UTF-8 validity, so we'd also do a check at conversion time (and fixup). Flag bits can be used to speed this up, e.g. if the bytes came from a String originally they can skip validation, or if we otherwise have guarantees of pointer lifetimes we can have a nil owner.

Certain flags (e.g. UTF-8 validity) will have to be cleared upon arbitrary mutation, so we'd have to design the mechanisms for this.

2 Likes