Idea: Bytes Literal

jrose · February 6, 2021, 2:08am

The main use case I have for this is switches, but I guess I could match a byte string literal against a prefix instead of a single byte.

jrose · February 6, 2021, 2:09am

One interesting aspect of reusing existing string literal syntax is that it sidesteps this question quite nicely. It's still weird that there's no safe bytes type in the standard library, sure, but you won't get an unsafe or suboptimal type by accident.

benrimmington · February 7, 2021, 12:12pm

Instead of adding new protocols (for byte strings and interpolation), could we simply add another initializer to the existing _ExpressibleByBuiltinStringLiteral protocol?

The new initializer would only be used when a string literal contains \x{...}.
A default implementation (if needed) would forward to the existing initializer.
String would replace invalid UTF-8 bytes with U+FFFD.
StaticString might allow invalid UTF-8 bytes, by updating preconditions, and adding APIs:
- e.g. public var isUTF8: Bool (using a spare bit in _flags).
- e.g. public var bytes: UnsafeRawBufferPointer (can be empty).
Byte string interpolation would use the existing protocols, with their StringLiteralType: _ExpressibleByBuiltinStringLiteral associated types.

Off-topic

Could the previous Unicode scalar literals pitch be solved by having a shorter type name?

public typealias Rune = Unicode.Scalar
extension StringProtocol {
  public typealias RuneView = Self.UnicodeScalarView
  public var runes: RuneView { self.unicodeScalars }
}

let rune = Rune("\u{1F600}")
rune.properties.isEmoji  //-> true
rune.properties.name     //-> "GRINNING FACE"
rune.utf8.count          //-> 4

let runes = "ABC".runes
runes.allSatisfy(\.isASCII)  //-> true
runes.count                  //-> 3

lukasa · February 8, 2021, 10:46am

I think this approach seems reasonable to me, I can see how it would be made to work. I'd be interested in work in this direction.

Relatedly, UInt8.init(ascii:) happens to be a shortcut way to this thing. Under optimisation this will compile to the UInt8 value of a single ASCII character. Of course, this behaviour is non-obvious due to the absence of clear constexpr semantics in Swift, and it'd be nice to have something that would more clearly perform this operation.

Karl · February 8, 2021, 5:21pm

I think there are kind of 2 discussions happening here - one about data literals (which presumably would look/work something like StaticString, support @compilerEvaluable when that's a thing, etc), and another about our general handling of owned data.

WRT the first discussion: it's good. We should do it.

WRT the second discussion:

lorentey:

I agree! However, the bucket-o-bytes type I have in mind is only intended to go the other way -- it would be able to work with whatever storage representation you have, but there is no expectation that other types would directly adopt it as a storage representation.
struct ByteBucket {
  let owner: AnyObject?
  let buffer: UnsafeRawBufferPointer
}
Note how this is just an UnsafeRawBufferPointer that maintains an owning reference to its storage.

It's interesting - I have quite a pressing need for something like this, so I'd be happy to help in any way I can. Similar issues came up in the shared string thread, and in discussion I've had with others about how to drive that feature forward. I also kind of touched on that topic when Array.init(unsafeUninitializedCapacity:, initializingWith:) was proposed, and again when the string version was proposed -- so it's something that has irritated me a bit over the years.

IMO, I should be able to go from ManagedBuffer<Header, UInt8> -> Array<UInt8> -> String without copying. At least for read-only use cases.

A simple (pointer, owner) pair object would be really great as a starting point, to represent "some pointer kept alive by an ARCed object". I wonder if the owner object could be anything more than a dumb ARCed thing, and maybe conform to protocols which we could as? cast to enable more functionality, such as checking for uniqueness or setting the count of initialized elements.

Most use-cases I have would require the buffer to be typed, though. For shared strings, I think we'd probably want to use a buffer of UInt8s rather than a raw buffer (UInt8 of course being the UTF8 codepoint type). Unfortunately this would lead to another RawByteBucket/ByteBucket<T> split

lukasa:

We have a common currency type: UnsafeMutableRawBufferPointer . As a sketch, we could imagine the protocol being this:
protocol AppendableRawBuffer: ContiguousBytes {
    mutating func withUnsafeUnitializedTrailingBytes<ReturnType>(minimumCapacity: Int, _ block: (UnsafeMutableRawBufferPointer, inout Int) throws -> ReturnType) rethrows -> ReturnType

    var count: Int { get }
}
This is probably the minimal viable feature set required to serialise data into a buffer. It's a bit of a pain to perform some operations (e.g. back-to-front serialization), but a basic forward-moving serialisation can be cheaply performed using this abstraction.

We could definitely use better APIs for inserting contiguous data in to contiguous collections! I had a need for something like this recently (for a similar reason; I'm simplifying a path string and serialising the result in a generic container, starting at the last path component and working towards the front), so I've been using the following family of functions. They've been incredibly useful.

/// Appends space for the given number of objects, but leaves the initialization of that space to the given closure.
///
/// - important: The closure must initialize **exactly** `uninitializedCapacity` elements.
///
mutating func unsafeAppend(
  uninitializedCapacity: Int, 
  initializingWith initializer: (inout UnsafeMutableBufferPointer<Element>) -> Int
)

mutating func unsafeReplaceSubrange(
  _ subrange: Range<Index>,
  withUninitializedCapacity newSubrangeCount: Int,
  initializingWith initializer: (inout UnsafeMutableBufferPointer<Element>) -> Int
)

Basically, it inserts some uninitialised capacity at the given place, and gives you a closure to initialise it out-of-order, in a similar fashion to Array.init(unsafeUninitializedCapacity: initializingWith:).

(The Int return value is supposed to be an independently-calculated version of how many elements the closure actually wrote. The implementation traps if you fail to initialise the entire inserted capacity).

lukasa · February 8, 2021, 5:29pm

A worthy goal, but one that Swift cannot easily achieve. Array<UInt8> and String are frozen: their representations cannot change without breaking the ABI. There is no space to change their layout to any one that does not own its storage.

In principle other types could be implemented to provide String-like operations on top of the raw type @lorentey envisions here, but those core types are set in stone until some justification comes along to break ABI.

Karl · February 8, 2021, 5:37pm

Luckily, String's ABI already has provisions for shared storage. You're correct about Array, though - that's unlikely to be achievable at this point (although the ABI is only a concern for Darwin platforms, and those support bridging to custom NSArray subclasses, so maybe there's a way around it).

xwu · February 8, 2021, 5:39pm

It does? I am not aware of this...

Karl · February 8, 2021, 5:43pm

Yep

See @allevato 's prototype implementation PR. It's really just the public entrypoints that are needed (and figuring out how we can make them safe, etc - see the shared substring thread for more info about the concerns there).

Michael_Ilseman · February 8, 2021, 11:55pm

Yes, I intentionally kept this out of inlinable code and orthogonal to whether the object is ObjC. String supports resilient shared and resilient foreign representations. Shared means it can give a pointer to contiguous UTF-8 as the result of a (not inlinable, read-only) function call, as opposed to masking off some biased bits for native. Shared still participates in the majority of the fast path. Foreign has no constraints, nor any extra performance guarantees. I did not bake in the assumption foreign is UTF-16 encoded, at least in the ABI (since -length is at least a function call anyways).

github.com

apple/swift/blob/2d2a810e66571b9f72fd782eb96e783ec966429e/stdlib/public/core/StringObject.swift#L306


      
          extension _StringObject.Nibbles {
            // The canonical empty string is an empty small string
            @inlinable @inline(__always)
            internal static var emptyString: UInt64 {
              return _StringObject.Nibbles.small(isASCII: true)
            }
          }
          
          /*
          
           Large strings can either be "native", "shared", or "foreign".
          
           Native strings have tail-allocated storage, which begins at an offset of
           `nativeBias` from the storage object's address. String literals, which reside
           in the constant section, are encoded as their start address minus `nativeBias`,
           unifying code paths for both literals ("immortal native") and native strings.
           Native Strings are always managed by the Swift runtime.
          
           Shared strings do not have tail-allocated storage, but can provide access
           upon query to contiguous UTF-8 code units. Lazily-bridged NSStrings capable of
           providing access to contiguous ASCII/UTF-8 set the ObjC bit. Accessing shared

I have (jokingly) been referring to a common interchange format as a "deconstructed COW" or simply struct 💥🐮 , so that you can do the following:

Array -> 💥🐮
String -> 💥🐮 (might copy if foreign, might allocate if small) -> String (shared)

And the -ness reflects a common agreement that the AnyObject? field is copy-on-write, that is if any holder of the owner has guaranteed uniqueness, it can do an in-place mutation. 💥🐮 allows for exclusive ownership of storage.

This could then be adopted by other types such as Data, ByteBuffer, etc., as appropriate.

xwu · February 9, 2021, 12:03am

So, if I understand your vision, we could then theoretically also have:

String -> 💥🐮 -> ByteBuffer (shared)
Array -> 💥🐮 -> ByteBuffer (shared)
ByteBuffer -> 💥🐮 -> ByteBuffer (shared)
ByteBuffer -> 💥🐮 -> String (shared)
(But not ByteBuffer -> 💥🐮 -> Array (shared), because Array's ABI is frozen.)

(Also, may I congratulate you on the fantastic name.)

Michael_Ilseman · February 9, 2021, 12:46am

Correct, assuming ByteBuffer supports a shared representation. There are potentially many variants (such as one that includes an inline buffer, segmented storage, just a protocol conformance, etc), but most of the advantage comes from a single interchange type.

There are lots of non-obvious but important design details here. There are potential issues with typed memory access (e.g. we should forbid Array<AnyObject> -> 💥🐮 -> Array<UInt8>).

We would want to carve off extra flags bits, such that we're not storing AnyObject? but rather something akin to a future _NativeObject type, which would be like _BridgeObject but without the Objective-C half-bit. Flag bits could also be useful for runtime consistency or safety checks (e.g. tracking memory binding status), etc.

We'd want to be careful about going from a nil owner to a shared storage type that is proclaimed memory-safe such as String. String is memory safe and can support a nil owner for literals, but we'd have to think long and hard before we allow a shared string to be constructed from an arbitrary pointer whose lifetime is unknown. If we are claiming that the result is a proper shared string, then that includes UTF-8 validity, so we'd also do a check at conversion time (and fixup). Flag bits can be used to speed this up, e.g. if the bytes came from a String originally they can skip validation, or if we otherwise have guarantees of pointer lifetimes we can have a nil owner.

Certain flags (e.g. UTF-8 validity) will have to be cleared upon arbitrary mutation, so we'd have to design the mechanisms for this.