Deep recursion in deinit should not happen

No problem, no apologies necessary – I was myself quite "strict" in my response because I fear that an opportunity could get lost to make Swift even greater than it already is :slight_smile:

I have an XML library where I need direct access to e.g. elements of a certain name, without having to traverse the whole tree. The list is in the order of creation, which is perfect, and consists of the elements themselves pointing to the next one with the same name, and pointing back to the previous one weakly (for removal). But of course, any element might get removed or renamed, so I need to be able to remove an element from that list in an efficient way. And there can be many elements of the same name in a big document. And after processing the XML document it is not needed anymore, including all elements and those lists.

Great to hear this, thanks.

1 Like

One of my test cases for Graphing Calculator is a file with ---...---x with 16k minus signs, which parses to a tree 16k deep and simplifies to "x". The C++ version manages memory manually, iterating over nodes without recursion, and a YACC state machine parser for modest stack requirements. (The Swift version's hand written recursive descent parser overflows stack before it has a chance to stress deinit.) While I wouldn't classify this as a real-world example, it has been part of my test suite for many years.

5 Likes

Here's a sketch of possible non recursive deinit algorithm. The algorithm maintains a linked list of "to be deallocated" objects, the list is maintained via the "next" variable inside the object itself.

Initially deallocation list is empty. When an object with retainCount = 1 is released - it is pushed on top of deallocation list. Then the list is processed from top. We pop the top object from the list, we take all reference type properties of this object and look at them one by one right to left. If the property's retain count is greater than 1 we simply decrement it. If the property's retain count is 1 this object is pushed to deallocation list top. After we placed all object's properties on the deallocation list we start processing this list again, starting from top. We repeat this until deallocation list is empty.

If I am not mistaken deallocation order of this algorithm will match the current recursive deallocation order. If needed objects whose RC > 1 can also be released in the current left to right order, but that's probably not important.

As these are the objects being deallocated that are placed on the list, we can probably reuse 8 bytes of the object proper as the "next" variable, if that's possible the algorithm will not require additional per object memory. The issue left open - thread safety of the release list top variable.

Edit: if retain count is stored inside the object itself (is it now? I remember some global tables were used in the past) and if RC is pointer sized - RC itself can be reused as a pointer, as AFAIK there is no guarantee about RC of deallocated objects.

Edit 2: there's probably no issue with deallocation list top variable thread safety, as it can be maintained per "top level release call", even as a local stack variable. This algorithm internals won't call top level release. Separate top level invocations of release / deinit will effectively use different deallocation lists.

Forgive me for the tangent: I tested the ARC and MRC (manual reference counting) version of this simple app that creates and drops a linked list a few times (keeping the list length within limits of the current ARC implementation). It did this in obj-c for simplicity (I suppose I'd get similar results for swift implementation that uses unowned references and some external table to keep strong references to the nodes).

ARC app
// ARC app
#import <Foundation/Foundation.h>

@interface Chained: NSObject {
    @public Chained* next;
}
@end

@implementation Chained
@end

void test(void) {
    NSLog(@"start");
    double allocTime = 0;
    double deallocTime = 0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Chained* first = nil;
        for (int i = 1; i < 30000; i++) {
            Chained* next = [Chained new];
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;
        
        start = CFAbsoluteTimeGetCurrent();
        first = nil;
        deallocTime += CFAbsoluteTimeGetCurrent() - start;
    }
    NSLog(@"allocTime: %f, deallocTime: %f, totalTime: %f", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 21.279008, deallocTime: 32.589441, totalTime: 53.868450
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}
MRC app
// MRC app
//specify -fno-objc-arc for this file in build phases

#import <Foundation/Foundation.h>

@interface Chained: NSObject {
    @public Chained* next;
}
@end

@implementation Chained
@end

void test(void) {
    NSLog(@"start");
    double allocTime = 0;
    double deallocTime = 0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Chained* first = nil;
        for (int i = 1; i < 30000; i++) {
            Chained* next = [Chained new];
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;
        
        start = CFAbsoluteTimeGetCurrent();
        while (first) {
            Chained* cur = first;
            first = cur->next;
            [cur release];
        }
        deallocTime += CFAbsoluteTimeGetCurrent() - start;
    }
    NSLog(@"allocTime: %f, deallocTime: %f, totalTime: %f", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 15.840458, deallocTime: 18.275354, totalTime: 34.115811
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}

Release mode, Intel mac:
ARC: allocTime: 21.27, deallocTime: 32.58, totalTime: 53.86
MRC: allocTime: 15.84, deallocTime: 18.27, totalTime: 34.11

oops. No free lunch with ARC.

PS. There are lies, damned lies, and benchmarks.

That any automated garbage collection (ARC or the Java way) comes with some performance penalty should not be surprising. I would have expected the mentioned "recursion" of deinits to be much more expensive, but did not notice much of a runtime difference in my own examples (with or without the code mentioned by @lorentey above to avoid a possible stack overflow). I think we do not have a runtime performance problem here, ARC is doing quite well in this respect.

3 Likes

The presence of performance degradation itself is unsurprising of course. How big it is - somewhat is; I'd assume the task at hand would be totally dominated by allocation/deallocation costs, adding, perhaps, 5% for reference count management overhead [just imagine a different test that doesn't involve allocation/deallocation, the ARC overhead would be very significant], in case objects are born, accessed and killed on a single thread. To be fair I need to replicate this test in swift and on ARM: both could have relevant optimisation improvements.

How do I mark objects immortal in Swift / Obj-C? Is it possible to do in in macOS/iOS app using standard unmodified Xcode toolchain?

Recursion itself is cheap: a function call, no dynamic memory allocation, etc. (I once saw a system where recursive calls caused memory allocation but that was an exception, normally it's not done this way.)

Any feedback on the sketch of algorithm outlined in post #23, can it fly?

Added manual memory management "swift unsafe" and C version:

Swift ARC
class StrongNode {
    var next: StrongNode? = nil
}

print("start")

var allocTime = 0.0
var deallocTime = 0.0

for _ in 0 ..< 10_000 {
    let start = CFAbsoluteTimeGetCurrent()
    var first: StrongNode? = nil
    for _ in 1...30_000 {
        let next = StrongNode()
        next.next = first
        first = next
    }
    allocTime += CFAbsoluteTimeGetCurrent() - start

    func deallocList() {
        first = nil
    }
    let start2 = CFAbsoluteTimeGetCurrent()
    deallocList()
    deallocTime += CFAbsoluteTimeGetCurrent() - start2
}

print("allocTime: \(allocTime), deallocTime: \(deallocTime), totalTime: \(allocTime + deallocTime)")
// allocTime: 20.429471373558044, deallocTime: 26.36126732826233, totalTime: 46.79073870182037
print("done")
Swift Unsafe
typealias NodePtr = UnsafeMutablePointer<Node>

struct Node {
    var next: NodePtr

    static func newNode() -> NodePtr {
        let node = calloc(MemoryLayout<Node>.size, 1)!.assumingMemoryBound(to: Node.self)
        return node
    }
    static func deleteNode(_ node: NodePtr) {
        free(node)
    }
}

print("start")

var allocTime = 0.0
var deallocTime = 0.0

for _ in 0 ..< 10_000 {
    let start = CFAbsoluteTimeGetCurrent()
    let firstNode = Node.newNode()
    var first = firstNode
    for i in 1...30_000 {
        let next = Node.newNode()
        next.pointee.next = first
        first = next
    }
    allocTime += CFAbsoluteTimeGetCurrent() - start

    let start2 = CFAbsoluteTimeGetCurrent()
    while first != firstNode {
        let cur = first
        first = cur.pointee.next
        Node.deleteNode(cur)
    }
    deallocTime += CFAbsoluteTimeGetCurrent() - start2
}

print("allocTime: \(allocTime), deallocTime: \(deallocTime), totalTime: \(allocTime + deallocTime)")
// allocTime: 14.85484755039215, deallocTime: 14.128059029579163, totalTime: 28.982906579971313

print("done")
C version
#include <stdio.h>
#include <CoreFoundation/CoreFoundation.h>

typedef struct NodeRec* Node;

struct NodeRec {
    Node next;
};

Node newNode(void) {
    Node node = malloc(sizeof(Node));
    node->next = NULL;
    return node;
}

void deleteNode(Node node) {
    free(node);
}

void test(void) {
    printf("start\n");

    double allocTime = 0.0;
    double deallocTime = 0.0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Node first = NULL;
        for (int i = 0; i < 30000; i++) {
            Node next = newNode();
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;

        double start2 = CFAbsoluteTimeGetCurrent();
        while (first) {
            Node cur = first;
            first = cur->next;
            deleteNode(cur);
        }
        deallocTime += CFAbsoluteTimeGetCurrent() - start2;
    }

    printf("allocTime: %f, deallocTime: %f, totalTime: %f\n", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 13.667640, deallocTime: 13.962520, totalTime: 27.630160

    printf("done\n");
}

int main(int argc, const char * argv[]) {
    test();
    return 0;
}

Obj-C ARC: allocTime: 21.27, deallocTime: 32.58, totalTime: 53.86
Obj-C MRC: allocTime: 15.84, deallocTime: 18.27, totalTime: 34.11
Swift ARC: allocTime: 20.43, deallocTime: 26.36, totalTime: 46.79
Swift Unsafe (*): allocTime: 14.85, deallocTime: 14.12, totalTime: 28.98
C version: allocTime: 13.66, deallocTime: 13.96, totalTime: 27.63

(*) - this uses raw memory data structures, unsafe pointers + malloc/free, quite different version compared to others.

(release mode, intel mac)

Edit: fixed bugs above and added "Swift unsafe" manual memory management version for completeness.

Edit 2: added C version.

1 Like

Out of curiosity, what CPU is this running on? I'd expect to see a very sharp difference between e.g. an M1 and an x86-64 CPU, due to differing atomic compare and swap performance.

x86-64, I don't have M1 mac to test, would be interesting to see tests results on M1.

I can try to run it later, but my M1 machine is my personal laptop and I don't install Xcode on it for work/life balance reasons, so it'll require a little maneuvering :sweat_smile:

2 Likes

Preliminary results for the Swift ARC example on an M1 Max (Xcode 13.2.1):

allocTime: 8.592, deallocTime: 18.922, totalTime: 27.515

I can try more later.

@tera Can you provide a project that can run these tests rather than need for us to build our own? Or a repo with the test cases?

2 Likes

To further the sample size, here are my results (also) on an M1 Max:

Swift ARC:

allocTime: 6.73, deallocTime: 16.76, totalTime: 23.49

Swift Unsafe:

allocTime: 3.66, deallocTime: 3.70, totalTime: 7.37

C Version:

allocTime: 3.22, deallocTime: 4.00, totalTime: 7.23
3 Likes

If anything, that's even more mysterious. Some interesting detective work to be done here!

Interesting results, thank you. I assume release build was used in all tests. Looks like ARC overhead dominates in this microbenchmark on M1 adding +200% while on Intel it adds a more moderate +60%. If anyone wants to explore this further we need to spawn a new dedicated thread.

[Edit: btw it would be much more honest to test obj-c ARC vs MRC versions to measure the impact of ARC: swift unsafe is quite different beast in a few ways (not just ARC) compared to Swift ARC version.]


Let's navigate back to "deep recursion in deinit should not happen".

One could imagine a version of autorelease which sets a “target low water mark” when it enters and then keeps releasing off the top of its stack until it gets there, which would almost-elegantly solve the problem… in environments that happen to have an autorelease pool lying around.

1 Like

Small clarification. The pure reference counting overhead is lower on M1 vs x86.
allocTime is 6.9s vs 13.3s
deallocTime is high on M1 (18s vs 12s) because it's doing a lot of pointer authentication work. Not exactly a fair comparison. This results from a deeply recursive call chain. So the overhead is only indirectly related to ARC.

1 Like

I agree. Obj-C (ARC & MRC) versions above are much closer to each other than Swift v Swift Unsafe.

This is a bit off-topic, but this line (within your Swift Unsafe code) results in undefined behavior. assumingMemoryBound assumes that the pointer is bound to the specified type, yet the result of calloc(MemoryLayout<Node>.size, 1)! hasn't been bound to any type.[1]

Instead, use bindMemory to bind memory to a type. Or use UnsafeMutablePointer.allocate to avoid working with unbound memory altogether.

Also, 0x0 is not a valid value for an instance of UnsafeMutablePointer. You should either initialize Node to a valid value before using it or make next a value of type UnsafeMutablePointer? (a trivial type where 0x0 is valid) instead.

Here's how I would write it
import CoreFoundation

typealias NodePointer = UnsafeMutablePointer<Node>

struct Node {
    var next: NodePointer?

    static func newNode() -> NodePointer {
        return calloc(MemoryLayout<Node>.size, 1)!.bindMemory(to: Node.self, capacity: 1)
    }
    
    static func deleteNode(_ node: NodePointer) {
        free(node)
    }
}

print("start")

var allocTime = 0.0
var deallocTime = 0.0

for _ in 0..<10_000 {
    let start = CFAbsoluteTimeGetCurrent()
    let firstNode = Node.newNode()
    var first = firstNode
    for _ in 0..<30_000 {
        let next = Node.newNode()
        next.pointee.next = first
        first = next
    }
    allocTime += CFAbsoluteTimeGetCurrent() - start

    let start2 = CFAbsoluteTimeGetCurrent()
    while first != firstNode {
        let cur = first
        first = cur.pointee.next.unsafelyUnwrapped
        Node.deleteNode(cur)
    }
    deallocTime += CFAbsoluteTimeGetCurrent() - start2
}

print("allocTime: \(allocTime), deallocTime: \(deallocTime), totalTime: \(allocTime + deallocTime)")

print("done")

Here's a good video that explains pointer safety in detail.

This is probably the greatest benefit of using ARC: no tricky pointers!


  1. According to cppreference.com:

    If the object was created by an allocation function (including realloc), it has no declared type.

    ↩︎
1 Like

The following implements a simplified version of non recursive deallocation algorithm that might be good for Swift / Objective-C and other ref count based languages.

  • The sketch is done in C++ subset (C with inheritance and inference)

  • The sketch reuses retainCount variable storage for the link variable for objects being deallocated.

  • The particular guinea pig user type doesn't matter for this sketch. Currently it uses a binary tree structure as an example. If you want to experiment with a different user data structure adjust these two overrides accordingly:

    virtual int objectCount() override {
        return 2;
    }
    virtual id objectAtIndex(int i) override {
        return i == 0 ? left : right;
    }
    

Sketch output when recursion is used (mimics the current recursive algorithm found in Swift & Objective-C). Note the indentation:

test1 (small tree)
    1 deinit Node root
        2 deinit Node left
            3 deinit Node left-a
            4 deinit Node left-b
        5 deinit Node right
            6 deinit Node right-a

test2 (long list)
b4 release
1000 deinit Node node
2000 deinit Node node
3000 deinit Node node
...
21000 deinit Node node
22000 deinit Node node
23000 deinit Node node

ThreadSanitizer:DEADLYSIGNAL
==67795==ERROR: ThreadSanitizer: SEGV on unknown address 0x60000108a000 (pc 0x0001001912b7 bp 0x7ffeef6f7250 sp 0x7ffeef6f7230 T2047778)
==67795==The signal is caused by a WRITE memory access.

Sketch output when recursion is not used:

test1 (small tree)
    1 deinit Node root
    2 deinit Node left
    3 deinit Node left-a
    4 deinit Node left-b
    5 deinit Node right
    6 deinit Node right-a

test2 (long list)
b4 release
1000 deinit Node node
2000 deinit Node node
3000 deinit Node node
...
98000 deinit Node node
99000 deinit Node node
100000 deinit Node node
after release
done
Implementation
#include <stdio.h>
#include <assert.h>

#define var auto
#define let const auto
#define nil nullptr

let useRecursion = false;

struct NSObject;

typedef NSObject* id;

static var gap = 0;
static var printEnabled = true;

void printGap() {
    for (var i = 0; i < gap; i++) {
        printf("    ");
    }
}

struct String {
    char str[256];
};

struct NSObject {
    union {
        int retainCount;
        id link;
    } u;
    
    NSObject() {
        u.retainCount = 1;
    }
    
    virtual ~NSObject() {}
    
    id retain() {
        u.retainCount++;
        return this;
    }
    
    void release() {
        if (--u.retainCount == 0) {
            dealloc();
        }
    }
    
    int retainCount() {
        return u.retainCount;
    }
    
    virtual void deinit() {
        static var printCount = 0;
        
        printCount++;
        if (printEnabled || (printCount % 1000) == 0) {
            if (printEnabled) {
                printGap();
            }
            printf("%d deinit %s\n", printCount, description().str);
        }
    }
    
    virtual void dealloc() {
        gap++;
        if (useRecursion) {
            deallocRecursive();
        } else {
            deallocNonRecursive();
        }
        gap--;
    }
    virtual String description() = 0;
    virtual int objectCount() = 0;
    virtual id objectAtIndex(int index) = 0;

    void deallocRecursive() {
        deinit();
        
        for (var n = objectCount(), i = 0; i < n; i++) {
            if (let object = objectAtIndex(i)) {
                object->release();
            }
        }
        delete this;
    }
    
    void deallocNonRecursive() {
        u.link = nil;
        var deallocTop = this;
        
        while (deallocTop) {
            let cur = deallocTop;
            deallocTop = cur->u.link;
            
            cur->deinit();
            
            for (var i = cur->objectCount() - 1; i >= 0; i--) {
                if (let object = cur->objectAtIndex(i)) {
                    if (object->retainCount() > 1) {
                        object->release(); // won't recurse
                    } else {
                        // object is being deallocated. can now reuse retainCount storage for link
                        object->u.link = deallocTop;
                        deallocTop = object;
                    }
                }
            }
            
            delete cur;
        }
    }
};

// Example guinea pig user data structure

struct Node: NSObject {
    typedef NSObject super;
    const char* name;
    Node* left;
    Node* right;
    
    Node(const char* name, Node* left = nil, Node* right = nil) : name(name), left(left), right(right) {}
    
    virtual String description() override {
        String string;
        snprintf(string.str, sizeof(string.str), "Node %s", name);
        return string;
    }

    virtual int objectCount() override {
        return 2;
    }
    virtual id objectAtIndex(int i) override {
        return i == 0 ? left : right;
    }
};

void test() {
    printf("test1 (small tree)\n");
    let tree = new Node("root", new Node("left", new Node("left-a"), new Node("left-b")), new Node("right", new Node("right-a")));
    tree->release();
    printf("\n");
    
    printf("test2 (long list)\n");

    printEnabled = false;
    Node* first = nil;
    
    for (var i = 0; i < 100000; i++) {
        first = new Node("node", first);
    }
    printf("b4 release\n");
    first->release();
    printf("after release\n");
    printf("done\n");
}

int main() {
    test();
    return 0;
}

/*
 
Sketch output when recursion is used (mimics the current recursive algorithm found in Swift & Objective-C). Note indentation:

test1 (small tree)
    1 deinit Node root
        2 deinit Node left
            3 deinit Node left-a
            4 deinit Node left-b
        5 deinit Node right
            6 deinit Node right-a

test2 (long list)
b4 release
1000 deinit Node node
2000 deinit Node node
3000 deinit Node node
...
21000 deinit Node node
22000 deinit Node node
23000 deinit Node node

ThreadSanitizer:DEADLYSIGNAL
==67795==ERROR: ThreadSanitizer: SEGV on unknown address 0x60000108a000 (pc 0x0001001912b7 bp 0x7ffeef6f7250 sp 0x7ffeef6f7230 T2047778)
==67795==The signal is caused by a WRITE memory access.

Sketch output when recursion is not used:

test1 (small tree)
    1 deinit Node root
    2 deinit Node left
    3 deinit Node left-a
    4 deinit Node left-b
    5 deinit Node right
    6 deinit Node right-a

test2 (long list)
b4 release
1000 deinit Node node
2000 deinit Node node
3000 deinit Node node
...
98000 deinit Node node
99000 deinit Node node
100000 deinit Node node
after release
 
*/

Edit: fixed some bugs in the implementation and now reusing retainCount variable storage for the link variable.

1 Like

How would you generalize this to mutually recursive types?