Stable pointer into a C string without copying it?

Hello,

I recently wanted to access stable pointers into a C string built from a Swift string, and I wanted to ask if there were a better way to do it.

My question involves parsing through a low-level C api, so it may be of general interest.


The use case is the following: I want to compile multiple SQL statements joined with semicolons with an SQLite API (sqlite3_prepare_v3) that accepts a const char *zSql as the beginning of a C string, and a const char **pzTail as an output pointer to the "unused portion of zSql" (read: everything after the compiled statement):

"INSERT INTO ...; DELETE ...;"
 ^ zSQL          ^ pzTail

In order to compile multiple statements, one uses the pzTail value as the zSQL on the next call to sqlite3_prepare_v3 (until the string is exhausted).

It is thus important to have a C string which is stable in memory across the multiple calls to sqlite3_prepare_v3, so that pointers point to the expected characters.


Swift strings can generate a C string with utf8CString, which returns a ContiguousArray.

A contiguous array is "a specialized array that always stores its elements in a contiguous region of memory." This looks like a good match: I can expect the C string to be located at a stable location in memory :-)


But it happens both withUnsafeBufferPointer(_:) and withUnsafeBytes(_:) methods of ContiguousArray document that:

The pointer argument is valid only for the duration of the method’s execution.

It also happens that I wanted to design an API where it is the user who decides if the iteration of SQL statement should stop or continue:

// User code
let statements = database.allStatements(sql: "...")
while let statement = try statements.next() {
    // use statement, break, etc
}

This means that I can not iterate all statements in a single stroke. I really need a stable C string for an unknown amount of time (until all statements are iterated, or the user decides the iteration should stop).


I ended up performing a copy of the C string:

// User input:
let sql: String

// Initialization of the iteration:
// Build a C string and copy it into a stable buffer
let cString = sql.utf8CString
let buffer: UnsafeBufferPointer<CChar> = cString.withUnsafeBytes { rawBuffer in
    let copy = UnsafeMutableRawBufferPointer.allocate(
        byteCount: rawBuffer.count,
        alignment: MemoryLayout<CChar>.alignment)
    copy.copyMemory(from: rawBuffer)
    return UnsafeBufferPointer(copy.bindMemory(to: CChar.self))
}

// Later when the iteration has ended:
buffer.deallocate()

This copy makes me sure the C string is at a stable location in memory, and I can invoke sqlite3_prepare_v3 several times with the result of its previous invocations, moving forward a pointer initialized at buffer.baseAddress.


Does anyone know if it is possible to avoid this copy?

EDIT: I could use offsets instead of stable pointers (by subtracting baseAddress from pzTail and using this offset in order to build a new pointer from the contiguous array on the next step of the iteration). This would be a valid solution. But how come C developers don't face the same problems and workarounds :sweat_smile: ?

3 Likes

One thing that stands out to me is that your API will need to take a copy of a String in most situations, because that's the only way you can ensure it won't change when you don't expect it to. If we use a (contrived) example:

var sql = "some sql"

class Executor { var sqlPointer: some Pointer }

Executor(sql: sql)

What happens if, right after, a user tried to modify sql in-place, without making a copy (maybe so they could reuse a modified version in another call to your API?)

sql.replaceSubrange(...sql.startIndex, with: "?")

If you could keep a pointer around to your string's memory around, the user could, unintentionally, modify the memory buffer you've got stored, which would be bad (for you.)

One possible way to approach this is to move the unsafety of this operation to the forefront of your API, by using UnsafePointer<Int8> (or StaticString, depending on your API's needs).

func execute(sql: UnsafePointer<CChar>) {
    // eventually deallocate
}

execute(sql: "some sql" + "some more sql")

What I can't vouch for is that the compiler is optimizing that automatic conversion to UnsafePointer. The ideal, imo, would be that for completely static strings (even concatenated ones) that UnsafePointer is cheap, and behaves like StaticString does (creates an immutable buffer of memory for the string that is stored on the stack), and for strings created at runtime a copy would be made.

UnsafePointer also comes with its own complications (like not giving you the ability to read the length of the string upfront, without having the user provide a second parameter.)

EDIT:

Welp, that didn't last long. UnsafePointer's automatic conversion works the same way as the methods you already mentioned, assuming it applies to static strings too:

The pointer created through implicit bridging of an instance or of an array’s elements is only valid during the execution of the called function. Escaping the pointer to use after the execution of the function is undefined behavior.

In which case, you might have to write and expose a C function in order to get this functionality but... that's kind of yucky.

EDIT EDIT:

There's actually no guarantee, now that I think about it, that even for C functions the pointer is allowed to outlive the call... hmph.

Reading your comment, I understand I did not tell that the designed API accepts a Swift String, not a CChar buffer. So all immutability guarantees of Swift value semantics apply. COW types such as String, Array or ContiguousArray just do not mutate by surprise.

The C string I need is private to the implementation. I understand that I need to turn the inner representation of a Swift string into an UTF8 C string understood by SQLite. It may involve copying a lot of bytes, and I'm OK with this. What I wish I wouldn't do is copy this C string again.

One possible way to approach this is to move the unsafety of this operation to the forefront of your API

I would still need a nicer API for end users. They usually provide SQL strings: that's what they type in their code, and what they store in SQL files.

EDIT: I clarified the initial comment telling that the user provides a Swift String. Thanks :-)

As the documentation says, you only get scoped access to the interior pointer. That means it is not safe to return from the next() function while keeping the pointer around. If you can use offsets to resume parsing on each call to next(), that seems like a good solution. So you haven't missed anything, and I don't think there's any better way than the one you've already thought of.

An alternative design could be to have the user pass the body of the loop as a closure to your function:

func forEachStatement(
  in sql: String, _ body: (Statement) throws -> Void
) rethrows -> Void {
  sql.utf8CString.withUnsafeBufferPointer { buffer in
    while let nextStatement = parseStatement(buffer) {
      try body(nextStatement)
    }
  }
}

This way, you can use the interior pointer and provide a callback for each parsed statement without exiting the withUnsafeBufferPointer closure. This isn't entirely satisfying, of course - you lose control flow for loops (break, continue, etc) and it doesn't gel very well with things like sequences or iterators. For instance, to create an Array from those statements, you'd have to go through a dance of creating an empty Array and appending each element individually.

As to why C developers don't face the same problems? Because the lifetime of dynamically allocated memory in C is not guaranteed. In C, you are responsible for ensuring the memory the pointer points to is valid, and if you get it wrong, the behaviour is undefined.

There are a couple of things we could do to make this more pleasing, though:

  1. Rather than calling the body closure, we could suspend what we're doing (while still inside the closure), return the value to whomever called us, and then resume from that point whenever somebody asks for the next statement. When iteration ends, we'd run to the end and finally exit the withUnsafeBufferPointer closure.

    In other words: we could make this a generator. And rather than calling a closure, we'd want to be yield-ing values. Conceptually, it is similar to the way _read and _modify coroutines work (although you can yield multiple times), and IIRC there is interest in adding generators to the language at some point, but no concrete plans right now.

  2. We could just give you a way to safely escape the pointer - i.e. to guarantee the lifetime of the memory it points to. In principle, a (pointer, owner) pair or deconstructed COW could work to give you a pointer that you can store safely.

4 Likes

In that case, wouldn't something like:

class Executor {
    
    let sqlPtr: some Pointer
    
    init(sql: String) {
        sql.utf8CString.withUnsafeBytes {
            //  copy these bytes to sqlPtr
        }
    }
    
}

suffice to reduce the number of copies down to one?

Thank you very much, that's the kind of soothing answer I was looking for. Let's switch to offsets and remove this copy :-)

Yes, I agree. This solution was rejected due to the ergonomic issues you mention. The throwing next() method comes from a Cursor protocol which already ships with some goodness on top of the raw while loop (ready-made map, flatMap, filter, RangeReplaceableCollection & Set initializer, etc.)

As to why C developers don't face the same problems? Because the lifetime of dynamically allocated memory in C is not guaranteed. In C, you are responsible for ensuring the memory the pointer points to is valid, and if you get it wrong, the behaviour is undefined.

I naively expected that ContiguousArray would handle this correctly across the whole lifetime of its storage. But any exported pointer would become invalid as soon as the array is modified! In order to ensure memory safety (and profit from the compiler/runtime checks of the law of exclusivity), ContiguousArray had no other option than exposing a transient buffer. I get it now.

So in the specific case of this thread, my lib is responsible for allocating a buffer in order to ensure the validity of the pointers needed by SQLite. This is now pretty clear (at least to me), but I needed help.

If I do not want to allocate, I need to use offsets. That's it.

There are a couple of things we could do to make this more pleasing, though:

I finally get why generators are not the same beast as iterators :-)

As for the future of Swift, I'm happy our best crew is so well inspired. Thank you again :+1:

2 Likes

Just mentioning that you can copy the string with strdup():

let sql: String = ...
let cString = strdup(sql)
// Do something with `cString` ...
free(cString)