Using C code from Swift causes EXC_BAD_ACCESS

I'm trying to use a C library from Swift and, apparently, Swift is releasing memory hold by an OpaquePointer for no reason I can find.

This is a link to the project in Github that reproduces the issue. The relevant file is the main.swift and the rest is regular package creation stuff (which isn't very clear anywhere, but still).

A very quick explanation of the code is:

  1. In OnigRegularExpression::search(:String, :String) both the allocation of the regex and the use in are in the same context
  2. In OnigRegularExpression::init(:String) and OnigRegularExpression::match(:String) the allocation is in the constructor and the use is in the other method.

For some reason the second approach isn't working.
Finally, to install the library you can do: brew install oniguruma

Any pointers at why this is happening would be great.

Thanks in advance.

PS: I have more questions, like why isn't everything in the header exposed to Swift, but, I'll take one thing at a time.

The primary reason I can see is that Swift is moving the data from & freeing some of the pointer parameters to onig_new, while onig_new is assuming the pointers can be persisted after it returns, if so it is likely both usages are UB, just one is spontaneously currently working.

it is likely both usages are UB

What's UB?

As far as I understand Swift would not manage the memory allocated and leave it alone as long as a pointer is kept around.

In general Swift presumes that all pointers created by passing an inout argument as a pointer (including to withUnsafeBufferPointer etc.) can be freed after the called function returns.

If onig_new stores any of the passed pointers (&regexPointer, patternPointer, &encoding, &error) and then assumes them to remain valid after it returns then it is Undefined Behavior (UB), the program is allowed to do whatever it wants (which usually is either crash or do what you intended).

That's not very wise of Swift, not if I'm explicitly keeping the pointer in question around. In any case, is there any way to force Swift to do the right thing here?

You create a pointer here: https://github.com/erick2red/coniguruma-test/blob/266c8ad30242d1337922f44dfd7d1e8c8955f6c9/Sources/coniguruma-test/main.swift#L23. This pointer is only valid for the duration of the with block. Once that exits the pointer is dangling and may not be used.

It is therefore unacceptable to pass it into Oniguruma, which is presumably saving it and using it later.

Swift cannot know you’re saving the pointer, as pointers aren’t reference counted. Even if it could, you are violating the explicit contract of the withUnsafeBufferPointer method.

Yes. Allocate the buffer pointer with UnsafeMutableBufferPointer.allocate and copy the bytes into it. This pointer is managed entirely by you and is not owned by the Array, and so will not move around. You are responsible for freeing it when you are done.

Why would it work in the other method? I tried without using a with block, and the error is the same.

I just tried this and it doesn't work either. Same thing happen. The snippet of code is below

public final class OnigRegularExpression {
    let regexPointer: UnsafeMutablePointer<OnigRegex?>

    public init(from pattern: String) throws {
        /* ... */
        regexPointer = UnsafeMutablePointer<OnigRegex?>.allocate(capacity: 1)
        patternChars.withUnsafeBufferPointer({ patternPointer in
            var regex: OnigRegex? = nil
            let result = onig_new(&regex,
                                  patternPointer.baseAddress,
                                  patternPointer.baseAddress?.advanced(by: patternPointer.count),
                                  OnigOptionType(),
                                  &encoding,
                                  OnigDefaultSyntax,
                                  &error)
            regexPointer.initialize(from: &regex, count: 1)
        })
    }

    public func match(string source: String) throws {
        /* ... */

        try sourceChars.withUnsafeBufferPointer({ charsPointer in
            let result = onig_search(regexPointer.pointee,
                                     charsPointer.baseAddress,
                                     charsPointer.baseAddress?.advanced(by: charsPointer.count),
                                     charsPointer.baseAddress,
                                     charsPointer.baseAddress?.advanced(by: charsPointer.count),
                                     region,
                                     ONIG_OPTION_NONE)

            /* ... */
        })
    }

Do you mean to pass regexPointer into onig_new instead? They point to different locations.

regexPointer.initialize(from: &regex, count: 1)
onig_new(regexPointer, ...)

This is tricky to debug, the onig APIs consume a lot of pointers, and I don't know which one outlive the function call. If they outlive the with call that create them, then it's UB as others have said, and it could be as early as initialize

var encoding = OnigEncodingUTF8
withUnsafeMutablePointer(to: &encoding, { useEncodings in
    var encs: UnsafeMutablePointer<OnigEncodingTypeST>? = useEncodings
    _ = onig_initialize(&encs, 1) // <=== Does this store encs?
})

No, that doesn't work either. The documentation on this is terrible at best. From what I understand initialize(::) will copy the memory from from to the memory dynamically allocated with allocate, so it only makes sense to call it after, onig_new has allocated the memory and stored the pointer (given that what we're doing is only copying the pointer)

The C definition is:

int onig_new (OnigRegex*, const OnigUChar* pattern, const OnigUChar* pattern_end, OnigOptionType option, OnigEncoding enc, OnigSyntaxType* syntax, OnigErrorInfo* einfo);

The only thing you need to keep around is OnigRegex*, the rest is local. Same with onig_initialize everything there is local.

It's also not the rest of the parameters, the SIGSEGV happens in the part of the C code accessing the pointer to the regex.

Be careful, though. UB also includes "working properly, for a while". It is possible that some pointers are already stale earlier, but the program is not crashing since no one overwrite the data up until that point.

So long as you know the rule "pointers don't escape with". You can even think of with functions as

  • allocate memory
  • copy data to that memory
  • call the closure using the allocated memory
  • deallocate memory after the closure exits.

Also, Unsafe*Pointer.allocate behaves like calloc, it just allocates a new memory, which you can point to as much as you want. The allocated memory will not be handled by Swift's memory management. So don't forget to call deallocate on the pointer after you're done.

That's about as much as I can help with.

This doesn't make sense either. What does it means? The memory allocated by onig_new is deallocated by Swift? If that's the case why does it work in one instance and not in the other, both methods use the call to with.... Swift is releasing the memory in one of the two and not the other?

Yes. Thanks. I would love to find someone who knew what is Swift doing here.

I think you misunderstood what @lukasa is telling you. withUnsafeBufferPointer gives you a buffer pointer that is guaranteed to exist only for the duration of the closure. You must not pass patternPointer.baseAddress to onig_new, because the resultant object will try to access the underlying buffer later (for example, when you call match), but that buffer won't exist. You must allocate your own UnsafeMutableBufferPointer.

Well, I do know what Swift is doing. I, however do not know what onig is doing.


From digging into the source code for a bit (just a little bit, so take it with a grain of salt), it seems that (one of) the culprits is encoding. You need encoding to stay alive while using the regex, but encoding is destroyed when init finishes execution. The first case is fine because encoding is destroyed at the end of the search function, which is long enough to parse the string.

I believe the intended usage is to directly reference the OnigEncodingUTF8.

On initialize:

withUnsafeMutablePointer(to: &OnigEncodingUTF8) { utf8Encoding in
  var encodings = [ utf8Encoding as Optional ]
  _ = onig_initialize(&encodings, 1)
}

On creating:

onig_new(..., &OnigEncodingUTF8, ...)

What I (and others) meant by the pointer not escaping is that:

withUnsafe... { pointer in //<=== This one
  // `pointer` is valid here
}
// `pointer` is *NOT* valid here

If onig_new does store patternPointer, the patternPointer could outlive the with function that creates them, which is bad.

1 Like

As you can see from previous posts. This wouldn't be the case. First, I've already tried without the call the .with... method. Second, if you look at the C definition of the function it receives a const char* which usually means that it either makes a copy or don't need it for further use. Third, that doesn't explain why the second method match(::) works which uses the same call to .with...

Yes. That did it. Now, teach me how you did that, so I don't have to bother the next time. I went into onig_search source code and down to this point, which made me think it was failing there. For some reason the type of the struct isn't exported so I couldn't do something.pointee.num_mem to test if it was that line.

Apparently, it was this line.

Again, thanks.
And again, just tell us how you did it, so I can do it the next time.

It was largely luck. Since I suspected memory management (you're working with C, of course... It's memory management...), I try to put variables in the innermost scope possible. Swift would deallocate data at the end of the scope or earlier so you'd have a lot less live data to mess with your assumption, and I usually prefer it that way anyway. And sure enough, when I was editing the orig_new in scenario 1.

From this:

var regexPointer: OnigRegex? = nil
var error = OnigErrorInfo()
var encoding = OnigEncodingUTF8
patternChars.withUnsafeBufferPointer({ patternPointer in
  let result = onig_new(...)
  ...
})

To this:

let regexPointer: OnigRegex = patternChars.withUnsafeBufferPointer { patternPointer in
  var regex: OnigRegex?, error = OnigErrorInfo()
  let result = onig_new(...)
  ...
}

then scenario 1 crashes. After moving things in & out of that scope, it's easy to see that encoding has something to do with it.

Then I went to doc & source code, and notice that they work with OnigEncoding, not OnigEncodingTypeST* (the former is an alias for the latter btw), so I speculate that OnigEncoding is an opaque reference (i.e. a class-like object). That'd be bad since we keep creating copies with let encoding = .... So I tried replacing all encoding with OnigEncodingUTF8, and sure enough, it works. The rest is to dig source code to figure out an explanation.


PS

most with functions return whatever you return from the closure, so you have a lot of freedom during initialization, but may need to annotate the return types sometimes. For example, during init, you can just do this:

// Now we don't need optional
let regexPointer: OnigRegex = patternChars.withUnsafeBufferPointer { patternPointer in
  var regex: OnigRegex!, error = OnigErrorInfo()
  let result = onig_new(&regex,
    patternPointer.baseAddress,
    patternPointer.baseAddress?.advanced(by: patternPointer.count),
    OnigOptionType(),
    &OnigEncodingUTF8,
    OnigDefaultSyntax,
    &error)

    if result != ONIG_NORMAL {
      print("Initialization failed with error: \(result)")
    }

    return regex
}

Also, you don't throw an error when result != ONIG_NORMAL there, which I find weird.


Not that the & bridging is only "shorthand" for appropriate with function:

// This
foo(&a)

// Is the same as
withUnsafeMutablePointer(&a) { x in
  foo(x)
}

Now we do the same for encoding:

onig_new(&regex, ..., &OnigEncodingUTF8, ...)

turns into

withUnsafeMutablePointer(&OnigEncodingUTF8) { encoding in
  onig_new(&regex, ..., encoding, ...)
}

which is bad. Remember, regex is storing the encoding, and so encoding is outliving the & shorthand that creates it. Honestly, I don't know if it's ok, but it is for a similar case. You gotta be careful though.

2 Likes

At this point these questions are academic.

I went through the code today, the C code and turns out OnigEncodingUTF8 it's just the pointer to a global struct created by the library, and now I'm lost at why it fails when encoding goes out of context since it's just copying the memory address.

    (reg)->enc            = enc;  // C code

I want to understand what's Swift doing? Clearing that enc field in the struct when encoding goes out of context? Or is it making encoding a pointer to a pointer, if that's the issue, when does C takes over and copy the initial pointer?

Skeleton sample code. Just to reproduce the problem.

OnigEncodingUTF8 is an OnigEncodingType, which is a struct. So we make a copy when we do:

var encoding = OnigEncodingUTF8

Now a pointer to encoding won't be the same as a pointer to OnigEncodingUTF8, and the former becomes invalid when encoding is deallocated.

enc is OnigEncoding, which is a pointer to OnigEncodingType. So they're copying a pointer to encoding, which becomes invalid when encoding is deallocated.

Swift doesn't clear the pointer, it just deallocate the memory that the pointer is pointing to (encoding).

1 Like

Makes sense now. Thanks.