Using String with zero-copy C APIs

davidbalbert · March 6, 2024, 6:01pm

I mentioned this in Pitch: Unicode Processing APIs, but I think it's a separate enough issue to warrant a different thread. It would be good to discuss this now before all the new ownership stuff gets locked in. What follows is a lightly adapted version of what I posted before.

There's a class of C APIs that are designed to be zero copy that are impossible to use from Swift – at least using String – without doing a memcpy of the underlying data. Here's an example.

Consider this (simplified) API from Tree-sitter:

typedef struct {
  void *payload;
  const char *(*read)(void *payload, uint32_t byte_index, uint32_t *bytes_read);
} TSInput;

TSTree *ts_parser_parse(TSParser *self, TSInput input);

TSInput is a closure. When you call ts_parser_parse, your TSInput closure is called N times, giving you the opportunity to provide a (pointer, length) pair to some contiguous string data. You don't have to provide the whole string at once. Just whatever is convenient – if you were using this with BigString from Swift Collections, returning a pointer to the data stored in a single leaf of the tree each time your closure is called is the obvious choice.

A Swift API might look something like this:

let str = "..."
let parse = Parser()
let syntaxTree = parser.parse {
    // this does not exist
    return str.escapingBufferPointer
}
// make sure str's lifetime extends to here

While your pointers escape the TSInput closure, Tree-sitter guarantees they do not escape the call to ts_parser_parse.

There's no way to safely do this using String without manually allocating a buffer, copying the contents of the string you want to escape into that buffer, and making sure to clean it up when you're done. For a concrete example, look at how SwiftTreeSitter handles this.

This is a more general problem than just String though, because Swift doesn't have any way to escape a pointer. AFAIK this is not fixed by [Pitch] Safe Access to Contiguous Storage. StorageView has withUnsafeBufferPointer, similar to other standard library types, but just like the existing methods, the provided pointer isn't allowed to escape from the closure that it's yielded to.

I think this is a relatively common pattern in C APIs, and it should be possible to call this class of API efficiently from Swift.

jrose · March 6, 2024, 6:11pm

I guess the API (for String and Array) would be something like “OwnedBuffer”, which would contain AnyObject (to keep the backing storage alive and immutable) and an UnsafeBufferPointer property that’s promised to be stable and alive as long as the parent is kept alive. That is indeed something the stdlib could expose…though it’d be worth thinking about if borrowing and move-only types could make it a little safer in common cases. Of course, the goal really is to pass a pointer to C, so in the end there’s always going to be a “you must keep this thing alive until you’re done with it” rule.

(ContiguousArray might even have enough guarantees to use directly today if you promise not to mutate it, but I don’t think it can share storage with String [yet]. And of course, it’s not documented as such, so it’s not a real guarantee of stability.)

davidbalbert · March 6, 2024, 6:49pm

That all makes sense. The dream is definitely something that uses borrowing or move-only types to let you express "I need to escape a pointer that should remain valid as long as the object it's derived from remains valid," and then keep that object alive.

I wonder if Swift's closure types could be extended to help with this. Even if we had a lifetime dependent escaping pointer type, I don't think you could wrap ts_parser_parse with the sample Swift API I wrote above.

At some point, you need to create a TSInput that has a function pointer and a context pointer. If it was easy to split a Swift closure into a pair of (function pointer, context pointer), and then reconstitute the closure later – ideally with the same sort of escape analysis used with borrowing or move-only types – you might get something pretty elegant. You'd want the compiler to infer:

The String's lifetime depends on the closure's lifetime - Exists today
The closure's lifetime depends on the call to the parse method (or to the lifetime of the Parser if you want to pass the closure to the parser during initialization) - Exists today
The escaped (function, payload) are valid as long as the closure is valid - Doesn't exist
The escaped pointer to the string is valid as long as the string is valid - Doesn't exist.

I think if you had all that, you might have everything you need to make this work without having to do any manual lifetime management.

Edit: After some more thinking, I think you could also have the closure be a single lifetime dependent pointer, pass the whole closure in as the TSInput's context and then call the closure from within the read function that you supply to the TSInput. Regardless, the "pass a (function, context) pair to a C API is extremely common, and it would be great to have first-class support for that in Swift no matter how it happens.

tera · March 6, 2024, 8:19pm

I think you can do this by nesting your parser into a withUnsafeBytes closure:

struct Parser {
    func parse(execute: () -> (Int, UnsafePointer<UInt8>)) {}
}

var s: String = ...
withUnsafeBytes(of: &s) { bufferPointer in
    let pointer = bufferPointer.baseAddress!.assumingMemoryBound(to: UInt8.self)
    let parser = Parser()
    let syntaxTree = parser.parse {
        (chunkSize, pointer + chunkOffset)
    }
}

[/quote]

davidbalbert · March 6, 2024, 9:19pm

Yeah I think you're right. That works because String's storage is a single contiguously allocated buffer. I hadn't thought of that, probably because when I was dealing with this I was using a custom string type like BigString.

That said, you'll run into trouble if you have a data representation that has more than one contiguous buffer. E.g. an array of Strings or a tree like BigString, which stores its contents across multiple leaves.

For BigString, you can imagine an API that yields an Array of buffer pointers to a closure, and then the solution looks pretty similar to your example, but I think that starts to get awkward. If it's expensive to gather all the pointers in the tree into an array before calling the closure, and if the API you're calling does any sort of streaming – in this example, that would mean yielding a partial parse tree for each call to read, which admittedly Tree-sitter doesn't do – responsiveness would be better if you can yield one pointer at a time.

I'm also not sure how easy it would be to implement the proposed API that I wrote up. For clarity, I think it's probably better to work in terms of the actual underlying C API to see what's possible. This doesn't invalidate your solution for a single string though. Here's what I came up with:

class BufferPointer {
    let baseAddress: UnsafePointer<UInt8>
    let count: Int
}

let parser = ts_parser_new()
var s: String = ...
s.withUTF8 { bufferPointer in
    // In theory Buffer could be a struct, but I'm pretty sure that if it was,
    // passing &buffer to the TSInput initializer would be a no-no as the
    // pointer is only valid for the call to the initializer.
    //
    // It also might be possible to pass in bufferPointer directly and then in
    // read cast the UnsafeMutableRawPointer to an UnsafeBufferPointer<UInt8>,
    // but I'm not sure if that's supported.
    let buffer = BufferPointer(baseAddress: bufferPointer.baseAddress!, count: bufferPointer.count)
    
    let input = TSInput(
        payload: Unmanaged.passUnretained(buffer).toOpaque(),
        read: read,
    }
    
    let syntaxTree = ts_parser_parse(parser, input)
}

func read(_ payload: UnsafeMutableRawPointer?, _ byteIndex: UInt32, _ bytesRead: UnsafeMutablePointer<UInt32>?) {
    let buffer: BufferPointer = Unmanaged.fromOpaque(payload!).takeUnretainedValue()
    bytesRead!.pointee = UInt32(buffer.pointee.count)
    return buffer.baseAddress
}

jrose · March 6, 2024, 9:23pm

Oh yeah, withCString and withUTF8 will flatten the string if necessary. But they’re only good enough if you can complete all your work synchronously inside the closure; the pointer is not guaranteed to be valid once the call ends. (This happens most obviously with small strings or when you mutate the string afterwards and it causes a reallocation.)

wadetregaskis · March 6, 2024, 9:25pm

And I think that defeats the point here, since the idea is to eliminate memory copies.

Is String guaranteed to be contiguous if it's actually the native Swift string, rather than NSString? In the same way as e.g. Array for non-ObjC Elements.

jrose · March 6, 2024, 9:27pm

Small strings aren’t in an allocation at all, so while they can present contiguous storage from these APIs and from withContiguousStorageIfAvailable, the pointer won’t be stable.

tera · March 6, 2024, 9:32pm

This is true, but at this point I'd have to ask: what exactly is the parser doing to the passed string fragment? Let's fairly assume it's doing some processing with the passed string fragment. One possibility would be refactor the parser callback calling:

let parser = Parser()
parser.parse { parserInput in
    // this is my callback, takes a fragment of a (big)string
    // and instead of "returning" it passes it to the `parserInput`
    let fragment: UnsafePointer<UInt8> = ...
    let fragmentSize = ...
    parserInput(fragment, fragmentSize) // the actual string processing happens here
    // fragment is not used beyond this point
}

davidbalbert · March 6, 2024, 9:51pm

So as to not lose the plot, I think small strings aren't a huge issue. The reason you'd care about avoiding the memcpy is if your strings are huge.

But I think @jrose's larger point is the right one:

There are enough use cases where Swift's closure-based pointer APIs aren't good enough.

It's clear that for a single String, even a very large one, the closure-based APIs are probably sufficient. While I haven't tested it, I think something like my adaptation of @tera's solution would probably work.

But for any other non-contiguous string storage (arrays, trees, etc.), you're in trouble. And at some point, working around the closure-based pointer APIs starts to feel non-ergonomic and a bit hacky.

There's a lot of work happening on fine grained allocation and ownership control in Swift right now. My hope in bringing this up is to make sure that whatever final version of that ships is able to handle this problem as well.

Yup, that's true. There are other ways to design the C API so that this is not an issue – an inversion of control where you push chunks onto the parser rather than having it ask you for chunks. And the answer to your question is indeed "doing some processing with the passed string fragment."

But for the purposes of this discussion I'd like to assume changing the C API isn't an option. Certainly it wasn't an option for me when using Tree-sitter, and I wouldn't be surprised if a lot of the use cases for Swift's C interop features are for interfacing with code that you don't own.

tera · March 6, 2024, 10:03pm

I'm not familiar with BigString. How its leafs are getting stored?
Maybe there's no problem of getting the stable pointer to the middle of its data? (stable until a modification).

davidbalbert · March 6, 2024, 10:14pm

It's a balanced tree. In this case, it's similar to a B-Tree (not too tall, very high branching factor, fully balanced by construction), but I think the specifics are less important than the fact that the text is stored across multiple different buffers.

In the general case, I think that's correct: you could build a version of BigString that can give you that guarantee, and that's a stronger guarantee than String can currently provide given the issues with small strings.

For the implementation that BigString has now, each of the leaves store a String, so without changing that, it wouldn't be able to give you a stable pointer. There's a lot you gain by using String in your leaves (I've learned this in my own implementation), so this isn't unexpected.

tera · March 6, 2024, 10:35pm

A quite long winded alternative would be to use two explicit threads and a semaphore to signal the information back and forth at strategic points. Here's a sketch implementation to give you an idea:

var _fragment: UnsafePointer<UInt8>?
var _size: Int = 0
let semaphore = DispatchSemaphore(value: 0)
let lock = NSLock()

func parserThread() {
    Thread {
        let parser = Parser()
        parser.parse {
            semaphore.signal()
            let (fragment, size) = lock.withLock {
                (_fragment, _size)
            }
            semaphore.wait()
            return (fragment, size)
        }
    }.start()
}

func readerThread() {
    Thread {
        fetch { fragment, size in
            semaphore.wait()
            lock.withLock {
                _fragment = fragment
                _size = size
            }
            semaphore.signal()
        }
    }.start()
}

func fetch(callback: (UnsafePointer<UInt8>?, Int) -> Void) {
    ...
}

struct Parser {
    func parse(execute: () -> (UnsafePointer<UInt8>?, Int)) {
        let (size, p) = execute()
        // TODO
    }
}

(could contain errors, it's just a sketch).
However, would that be faster than string (fragment) copying?
Plus it's definitely uglier.

davidbalbert · March 6, 2024, 11:24pm

Ha, I hadn't thought of this. I see what you're getting at, and I think that would work.

I've often wished that Swift had synchronous, stackful coroutines for many reasons – "interior iteration," for one, where you can build a yielding iterator out of a loop, implicitly storing your iteration state on the stack instead of managing it yourself in an Iterator struct.

I think you'd probably be able to use coroutines in the same way as you're doing here, but without the overhead of threads, and without the difficulty of doing synchronization correctly.

It's a very clever solution though. Coroutines (or threads in your sketch) give you the ability to do the inversion of control that you want without having to change the C API.

I'm curious which would be a bigger lift: getting Swift's ownership features robust enough to support this use case, or getting stackful coroutines into the language. It would be interesting to hear from some folks who work on these parts of the compiler.

I know there are problems with making a coroutine implementation work with the C ABI though, so maybe it's a moot point.

I'm not sure, but I'd be extremely reluctant to introduce multiple threads and locks, with all the associated foot guns that come with them just to work around this issue, even if it were faster.

tera · March 6, 2024, 11:41pm

BTW, wouldn't this simpler solution work?

struct Parser {
    func parse(execute: () -> (String, offset: Int, size: Int)) {
        var (string, offset, size) = execute()
        withUnsafeBytes(of: &string) { p in
            let pointer = p.baseAddress!.assumingMemoryBound(to: UInt8.self)
            pointer + offset // that's your pointer
            // do something here
        }
    }
}

Parser().parse {
    let string: String = ...
    let offset: Int = ...
    let size: Int = ...
    return (string, offset, size)
}

(I expect withUnsafeBytes taking O(1) time/space but I didn't check that!)

davidbalbert · March 6, 2024, 11:53pm

I'm not sure I follow. I think this might be getting a bit off track though.

I don't think this example, or any example that uses Swift's closure-based pointer APIs, solve the general problem that some C APIs would like you to escape a pointer by returning it from a function, and that's something that's very hard, and often impossible, to do in Swift without managing memory yourself and doing unnecessary memcpys.

My hope is to keep this discussion focused around ways that the language could change to make this sort of thing possible, ergonomic, and pleasant to do.

tera · March 7, 2024, 12:25am

I don't think this is possible with String. NSString has this feature though. It is quite unsafe.

Ditto for NSData.bytes.