Idea: Bytes Literal

Michael_Ilseman · February 8, 2021, 11:55pm

Yes, I intentionally kept this out of inlinable code and orthogonal to whether the object is ObjC. String supports resilient shared and resilient foreign representations. Shared means it can give a pointer to contiguous UTF-8 as the result of a (not inlinable, read-only) function call, as opposed to masking off some biased bits for native. Shared still participates in the majority of the fast path. Foreign has no constraints, nor any extra performance guarantees. I did not bake in the assumption foreign is UTF-16 encoded, at least in the ABI (since -length is at least a function call anyways).

github.com

apple/swift/blob/2d2a810e66571b9f72fd782eb96e783ec966429e/stdlib/public/core/StringObject.swift#L306


      
          extension _StringObject.Nibbles {
            // The canonical empty string is an empty small string
            @inlinable @inline(__always)
            internal static var emptyString: UInt64 {
              return _StringObject.Nibbles.small(isASCII: true)
            }
          }
          
          /*
          
           Large strings can either be "native", "shared", or "foreign".
          
           Native strings have tail-allocated storage, which begins at an offset of
           `nativeBias` from the storage object's address. String literals, which reside
           in the constant section, are encoded as their start address minus `nativeBias`,
           unifying code paths for both literals ("immortal native") and native strings.
           Native Strings are always managed by the Swift runtime.
          
           Shared strings do not have tail-allocated storage, but can provide access
           upon query to contiguous UTF-8 code units. Lazily-bridged NSStrings capable of
           providing access to contiguous ASCII/UTF-8 set the ObjC bit. Accessing shared

I have (jokingly) been referring to a common interchange format as a "deconstructed COW" or simply struct 💥🐮 , so that you can do the following:

Array -> 💥🐮
String -> 💥🐮 (might copy if foreign, might allocate if small) -> String (shared)

And the -ness reflects a common agreement that the AnyObject? field is copy-on-write, that is if any holder of the owner has guaranteed uniqueness, it can do an in-place mutation. 💥🐮 allows for exclusive ownership of storage.

This could then be adopted by other types such as Data, ByteBuffer, etc., as appropriate.