String ContiguousBytes Conformance

rlovelett · February 18, 2020, 5:53pm

I have a function whose signature is func foo<B: ContiguousBytes>(bytes: B). I'd like to be able to take a String and pass it to foo. This means that String must conform to ContiguousBytes.

I've come up with an extension to String to add conformance to ContiguousBytes.

extension String: ContiguousBytes {
    public func withUnsafeBytes<R>(_ body: (UnsafeRawBufferPointer) throws -> R) rethrows -> R {
        var copy = self
        return try copy.withUTF8 { try body(UnsafeRawBufferPointer($0)) }
    }
}

In my limited tests this seems to work well enough.

I am interested in this communities suggestions on if there is a better solution(s). Additionally, are there any hidden problems with this solution I should be aware of?

Karl · February 18, 2020, 7:09pm

ContiguousBytes is a Foundation protocol (interestingly enough, a similar protocol was rejected from the standard library in favour of withContiguousStorageIfAvailable, and the standard library has its own, internal notion of something that consists of contiguous bytes, which is only ever used by a couple of String initialisers). So it’s kind of a mess right now.

Since CB lives in Foundation, it’s unfortunately outside of swift-evolution’s scope.

Also, I’m not sure that the conformance should live on String directly; it may make more sense for it to live on the UTF8 view (which also implements withContiguousStorageIfAvailable)

rlovelett · February 18, 2020, 8:54pm

Sorry @Karl I probably was not being clear. I'm not trying to evolve the language. Just for my own use.

Mostly I'm asking about if there is a better way to get String conforming to ContiguousBytes than what I presented. Or what I should be careful of in the implementation I presented.

Jens · February 18, 2020, 9:17pm

There's a rule of thumb that says you should not conform types you do not own to protocols you do not own.

Is there a reason why you must pass exactly a String to foo? Because the following seems simpler, no need for implementing any protocol etc, just this:

if let utf8Data = yourString.data(using: .utf8) {
    foo(bytes: data)
} else {
    // ...
}

Or if you know that yourString will always be representable as utf8:

foo(bytes: yourString.data(using: .utf8)!)

?

rlovelett · February 18, 2020, 9:46pm

I tried that one too. I have never personally liked data(using:). I dislike the Data?. I do not understand what that Data? is and therefore I fear it.

I will admit my ignorance, it is never really clear to me (and I don't know how to independently verify it); does that make a copy of the String (i.e., is it essentially an malloc and memcpy_s)? I assume it is. Assuming the copy, this seems like a weird way to get a contiguous set of bytes for my String. I have to make a copy, a copy that might fail, of a thing I already have.

Even in the case that it is not a memcpy_s it still introduces the else condition I have to deal with or a force unwrap.

Without adding conformance to ContiguousBytes you could achieve the same thing by doing:

var str = "Hello, playground"
str.withUTF8 { foo(bytes: $0) }

It seems that this one does it without the copy and without the possible error condition. But I cannot shake the feeling that I'm deluding myself.

bjhomer · February 18, 2020, 9:46pm

Is there any case where a Swift String is not representable as UTF8?

bzamayo · February 18, 2020, 9:51pm

A more expressive form that doesn't have the force-unwrap ugliness is Data(str.utf8)

rlovelett · February 18, 2020, 9:58pm

I like the alternative syntax as a nice way to side-step the force-unwrap.

Am I right to assume that this creates a new copy of the memory behind the String in a new Data? Not a deal breaker if it is. Just trying to gain understanding.

Lantua · February 18, 2020, 10:02pm

Semantically it would. Since it is using the generic init and there's no guarantee that underlying data is continuous, Data needs to copy the data out.

With that said, I would benchmark it to see if the compiler can optimize through that.

Found this

Karl · February 18, 2020, 10:29pm

I see. As @Jens pointed out, it’s unwise to conform a type you don’t own to a protocol you don’t own. Foundation might decide to introduce its own conformance one day, and in that case, bad things might happen. So please don’t ship this in a public library.

To answer your question: yes, this is a perfectly fine way to conform to CB, and is exactly what the standard library does for its internal version. You don’t need to copy in to a Data or Array.

Personally, I would do it on UTF8View, but that’s just me.

xwu · February 19, 2020, 12:49am

To be clear, Foundation won't ever conform String to ContiguousBytes because semantically it can't.

The protocol requires conforming types to have an underlying contiguous collection of raw bytes, and (as the documentation for withUTF8 makes clear) String does not always have such a contiguous collection of raw bytes as backing storage. You can study the difference between the standard library's _HasContiguousBytes and Foundation's ContiguousBytes to see why String conforms to one but not the other.

Karl · February 19, 2020, 12:57am

You could say the same for Array. It’s a direct parallel because the only time Array or String could be non-contiguous is when they are wrapping a bridged Obj-C type (which can be subclassed and implemented however you like).

EDIT: oh, apparently Array’s conformance sidesteps this by making it conditional on the element being a scalar. While there are well-known tricks to insert scalars in to Obj-C/CF collections, it doesn’t appear to be officially supported according to the documentation. So foundation manages to tiptoe around that issue.

xwu · February 19, 2020, 3:05am

Right, precisely!

rlovelett · February 20, 2020, 8:07pm

I wanted to follow up and say thank you all for the discussion. There were more than a few things I learned here and I appreciate it.

In the end I think this is going to be the implementation I go with.

func foo<S: Sequence>(bytes: S) where S.Element == UInt8 {
  // use withContiguousStorageIfAvailable
}

let buffer = UnsafeMutableBufferPointer<UInt8>.allocate(capacity: 42)
defer { buffer.deallocate() }

let str = "Hello, playground"
foo(bytes: str.utf8)
foo(bytes: buffer)

I think there are number of wins here but the notable ones to me are:

No dependency on Foundation for ContiguousBytes. In turn, no temptation to add conformance to protocols/types I do not control.
The Sequence protocol is widely conformed in the Swift standard library. So now foo works on String (by way of UTF8View) or even something much lower like UnsafeMutableBufferPointer.
According to the docs there is no unnecessary memory copy. Obviously this is not the same as no memory copy. But that is a trade off I personally like.

As I said, thanks for the discussion I really appreciated it.