Deep recursion in deinit should not happen

It should be possible to recursively destroy Swift object graphs without using up stack space or changing observable behavior, by reclaiming the freed space from objects as context for tracking the rest of the object graph to release. There was a paper describing this technique recently:

We have some ideas for how to apply this technique to Swift, but we haven't gotten around to implementing them yet.

29 Likes

The authors admit that their technique is not applicable in the presence of Rust’s bit-packing, which Swift also emulates:

We did not find a formal description of Rust’s optimised memory layout, but it is likely that the representation of some types must be reconsidered if one wants to support efficient destructors.

I also wonder whether their technique relies on total visibility into the types that make up the recursive structure. If so, their technique is possibly not applicable if mutually-recursive types are separated by a resilience boundary.

In Swift's case, we have the benefit that all refcounted objects have a consistent header layout, which gives us at least two guaranteed words of extra space we can cram tracking information into.

13 Likes
Same issue in Objective-C
#import <Foundation/Foundation.h>

@interface Chained: NSObject
@property Chained* next;
@end

@implementation Chained
- (void)dealloc {
    static int n = 0;
    if ((n % 1000) == 0) {
        NSLog(@"dealloc %d", n);
    }
    n += 1;
}
@end

void test(void) {
    NSLog(@"building...");
    Chained* first = [Chained new];
    Chained* last = first;
    int n = 100000; // 1000000 for release mode
    for (int i = 1; i < n; i++) {
        Chained* next = [Chained new];
        last.next = next;
        last = next;
    }
    last = nil;
    
    NSLog(@"forgetting references...");
    first = nil;
    
    /*
     Thread sanitizer, running outside Xcode
    2022-02-02 00:20:23.496 StackOverflow[15381:446319] dealloc 30000
    2022-02-02 00:20:24.229 StackOverflow[15381:446319] dealloc 31000
    2022-02-02 00:20:24.980 StackOverflow[15381:446319] dealloc 32000
    ThreadSanitizer:DEADLYSIGNAL
    ==15381==ERROR: ThreadSanitizer: stack-overflow on address 0x7ffee22edff8 (pc 0x00010d13a21c bp 0x7ffee22ee020 sp 0x7ffee22ee000 T446319)
    */
    
    /*
     No thread sanitizer:
     2022-02-02 00:21:45.604 StackOverflow[15425:447543] dealloc 38000
     2022-02-02 00:21:45.605 StackOverflow[15425:447543] dealloc 39000
     2022-02-02 00:21:45.605 StackOverflow[15425:447543] dealloc 40000
     Segmentation fault: 11
     */
    
    /*
     release mode:
     2022-02-02 00:22:29.130 StackOverflow[15478:448365] dealloc 45000
     2022-02-02 00:22:29.130 StackOverflow[15478:448365] dealloc 46000
     2022-02-02 00:22:29.130 StackOverflow[15478:448365] dealloc 47000
     Segmentation fault: 11
     */
    NSLog(@"done");
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}
Ditto for the manual reference counted version (different number of deallocations until crash)
specify -fno-objc-arc for this file in build phases

#import <Foundation/Foundation.h>

@interface Chained: NSObject {
    @public
    Chained* next;
}
@end

@implementation Chained
- (void)dealloc {
    static int n = 0;
    if ((n % 1000) == 0) {
        NSLog(@"dealloc %d", n);
    }
    n += 1;
    Chained* next = self->next;
    [super dealloc];
    [next release];
}
@end

void test(void) {
    NSLog(@"building...");
    Chained* first = [Chained new];
    Chained* last = first;
    int n = 1000000;
    for (int i = 1; i < n; i++) {
        Chained* next = [Chained new];
        last->next = next;
        last = next;
    }
    last = nil;
    
    NSLog(@"forgetting references...");
    [first release];
    
    /*
     Thread sanitizer, running outside Xcode
     2022-02-02 00:38:59.498 StackOverflow[15633:461353] dealloc 67000
     2022-02-02 00:38:59.922 StackOverflow[15633:461353] dealloc 68000
     2022-02-02 00:39:00.358 StackOverflow[15633:461353] dealloc 69000
     2022-02-02 00:39:00.808 StackOverflow[15633:461353] dealloc 70000
     ThreadSanitizer:DEADLYSIGNAL
     ThreadSanitizer:DEADLYSIGNAL
     ThreadSanitizer: nested bug in the same thread, aborting.
    */
    
    /*
     No thread sanitizer:
     2022-02-02 00:40:15.630 StackOverflow[15722:463283] dealloc 126000
     2022-02-02 00:40:15.630 StackOverflow[15722:463283] dealloc 127000
     2022-02-02 00:40:15.630 StackOverflow[15722:463283] dealloc 128000
     2022-02-02 00:40:15.630 StackOverflow[15722:463283] dealloc 129000
     2022-02-02 00:40:15.631 StackOverflow[15722:463283] dealloc 130000
     Segmentation fault: 11
     */
    
    /*
     release mode:
     2022-02-02 00:40:47.407 StackOverflow[15765:463962] dealloc 171000
     2022-02-02 00:40:47.407 StackOverflow[15765:463962] dealloc 172000
     2022-02-02 00:40:47.408 StackOverflow[15765:463962] dealloc 173000
     2022-02-02 00:40:47.408 StackOverflow[15765:463962] dealloc 174000
     Segmentation fault: 11
     */
    NSLog(@"done");
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}

I wonder how did we not notice it before.. No Objective-C user was using linked lists with 30-100K elements in the last 38 years? :thinking:

4 Likes

A similar example in C++ (excuse my C++-17):

#include <iostream>
#include <memory>
#include <optional>

struct Node {
    std::optional<std::unique_ptr<Node>> next;
};

int main(int argc, char *argv[]) {
    {
        std::optional<std::unique_ptr<Node>> root;
        for (auto i = 0; i < 1000000; ++i) {
            auto new_node = std::make_unique<Node>();
            new_node->next = std::move(root);
            root = std::optional{std::move(new_node)};
        }
    } // => Segmentation fault before ever hitting "Oops".
    
    std::cerr << "Oops" << std::endl;
}

This is a latent issue in most (all?) refcounting systems where manual care isn't taken to prevent long object chains.


Honestly, I'm myself a little surprised at the general surprise in this thread — in practice, it's pretty rare to have such a deep, acyclic object graph whose sole reference is a single root item such that deallocating the graph blows up the stack.

Non-intrusive linked lists are very rare in practice, and in many cases, once your object graph grows large enough, it's very common to:

  1. Have links across objects in the graph such that it's unlikely that taking down a single object will deallocate the whole graph at once, and/or the root of the graph is effectively immortal for the lifetime of the program, making destruction occur only at the end of execution lifetime (e.g. many apps)
  2. Have some form of entity management system which scales much better than direct object access (and in many cases, a side effect of such a system is a flattening of the object hierarchy, which prevents this in the first place)

At the risk of over-extending myself, I'd say that in practice, this is largely a non-issue for the vast majority of code out there. Obviously, it's not impossible to end up in a situation where manually overriding deinit is necessary, but it seems unlikely that this is a scenario worth optimizing for as a primary use-case.

(I'd love to see the ideas that @Joe_Groff has linked to explored and put into practice, but I don't feel like this might be an immediate, urgent issue.)

8 Likes

I have such a use case where I do need linked lists of many objects (linked lists because some items might get removed) and the whole structure might not be needed any more at one point. It might not occur often as a use case, but it certainly is a sensible use case, and I do not see any replacement for it.

I would maybe not argue about the urgency of this issue (just one comment about the urgency at the end of my comment), but this issue is certainly an important one. Let me justify:

  • The promise of the reference counting used by Swift is that it is a good replacement for garbage collection as implemented in Java and other languages if you just take care of possible cyclic references.
  • The use case described (at least with the large number of items) might not be a very common one, but it still a sensible, simple use case without an apparent alternative.
  • It is not clear in the described use case that a stack overflow could occur. You need to have knowledge about how reference counting is implemented in Swift to guess that something like this could happen. From the practitioner's view is is an unexpected behavior.
  • The two comments by @Joe_Groff above suggest that this problem it not inevitable. I seems that it can be resolved.

As a (maybe a little bold, and certainly a very personal) general statement, I think in a language that is started with such a large promise something like the described problem should not occur. This does not diminish how great this language is and how much very good and successful work has been put into it, but I really think that it is just not OK for Swift to have this issue.

One comment about the urgency (or the timing, if you like): ARC improvements according to this forums entry are on the way anyway, so maybe it is just the right time now?

1 Like

Apologies; I think my post came out unnecessarily abrupt. To clarify a little bit:

Absolutely! There are sensible use-cases that might hit this, and I'd be interested to hear more about your usage of linked lists in particular. (e.g., what access patterns for your data look like such that linked lists are the optimal choice for storage)

The Rust issue linked up-thread, for instance, doesn't add much backing information beyond the toy example, so real-world details add necessary nuance to a solution here.

I don't disagree with this statement at all — there was some commentary in this thread that made it sound (to me) as if this problem is both new, and possibly serious enough to need fix ASAP; my choice of words was aimed at that feeling, but maybe I misinterpreted others' wording.

In any case, I think there are definitely interesting developments in this area (e.g. the paper that Joe linked to is only from 2019, so it's not like this is a decades-old solved problem) that would be great to learn from (indeed, especially now if we're looking at other improvements to ARC).

4 Likes

I really hope this is not an inherent ARC limitation... Otherwise it can be a show stopper for ARC.

IMHO this is very serious. Imagine reference counting (¹) based web service that allows working with linked lists and allows hundred thousands of nodes. It is now obvious how malicious users can easily crash my service by merely creating and dropping a linked list of 50K elements. Can be very bad for swift publicity ("gee, our Java / Go / C# service can easily handle 1M nodes and swift service crashes after 50K!")

(¹) are there any precedents of ref count based languages that do not have this problem?

I agree in general that typically 30k element linked lists aren't the ideal data structure, and there are almost always better options. But for the sake of a somewhat "real world example" I first came across this issue when I was naively parsing Lisp in Rust. Everything seemed to work, until I came across a massive file that looked like (in pseudo-lisp)

(let x = 1 in (let y = 2 in ( let z = 3 in ....)))

Yeah, there are ways to get around this issue, but it makes the ADT more complicated and less representative of the AST that the Lisp represents.

I've dealt with this issue in ObjC (NSXMLDocument's internals) and C++ (WebKit's internals) before. My favorite suggestion in the ObjC case came from a former exec who pointed out that if I autoreleased, instead of released, in dealloc, it would "smear" the stack usage across the next N turns of the runloop, avoiding stack overflow. I didn't end up doing that, but it was a fun idea.

8 Likes

It has been; particularly folks do notice it with NSOperationQueue from time to time when they fall into the pattern of super-linear chains of dependencies (when they should just be using a dedicated maxConcurrentOperations = 1 queue with cross queue dependencies to managed the non linear graph of dependencies).

By-in-far it is not common however since most usage for things like linked lists are often better served with arrays.

3 Likes

Easy example when I would want to use linked lists instead of array: Let's say I'm implementing object reuse, and instead of deiniting / initialising objects from scratch I put deleted objects into a linked list, and when I need a new object I take it from there. In this case a linked list is better than array (unless of course we are fighting with the issue being discussed). Obviously the list can grow massively. Now I am deleting this list because, say, I am changing scenes in the app - oops.

In that case perhaps a Set<SomeSortOfHashableBoxByObjectIdentifier<T>> might be easier to use instead of a linked list.

I am not saying that it is a one size fits all answer, just often I find that going to a single or even doubly linked list makes things much more complicated at perhaps un-needed complexity.

4 Likes

No problem, no apologies necessary – I was myself quite "strict" in my response because I fear that an opportunity could get lost to make Swift even greater than it already is :slight_smile:

I have an XML library where I need direct access to e.g. elements of a certain name, without having to traverse the whole tree. The list is in the order of creation, which is perfect, and consists of the elements themselves pointing to the next one with the same name, and pointing back to the previous one weakly (for removal). But of course, any element might get removed or renamed, so I need to be able to remove an element from that list in an efficient way. And there can be many elements of the same name in a big document. And after processing the XML document it is not needed anymore, including all elements and those lists.

Great to hear this, thanks.

1 Like

One of my test cases for Graphing Calculator is a file with ---...---x with 16k minus signs, which parses to a tree 16k deep and simplifies to "x". The C++ version manages memory manually, iterating over nodes without recursion, and a YACC state machine parser for modest stack requirements. (The Swift version's hand written recursive descent parser overflows stack before it has a chance to stress deinit.) While I wouldn't classify this as a real-world example, it has been part of my test suite for many years.

5 Likes

Here's a sketch of possible non recursive deinit algorithm. The algorithm maintains a linked list of "to be deallocated" objects, the list is maintained via the "next" variable inside the object itself.

Initially deallocation list is empty. When an object with retainCount = 1 is released - it is pushed on top of deallocation list. Then the list is processed from top. We pop the top object from the list, we take all reference type properties of this object and look at them one by one right to left. If the property's retain count is greater than 1 we simply decrement it. If the property's retain count is 1 this object is pushed to deallocation list top. After we placed all object's properties on the deallocation list we start processing this list again, starting from top. We repeat this until deallocation list is empty.

If I am not mistaken deallocation order of this algorithm will match the current recursive deallocation order. If needed objects whose RC > 1 can also be released in the current left to right order, but that's probably not important.

As these are the objects being deallocated that are placed on the list, we can probably reuse 8 bytes of the object proper as the "next" variable, if that's possible the algorithm will not require additional per object memory. The issue left open - thread safety of the release list top variable.

Edit: if retain count is stored inside the object itself (is it now? I remember some global tables were used in the past) and if RC is pointer sized - RC itself can be reused as a pointer, as AFAIK there is no guarantee about RC of deallocated objects.

Edit 2: there's probably no issue with deallocation list top variable thread safety, as it can be maintained per "top level release call", even as a local stack variable. This algorithm internals won't call top level release. Separate top level invocations of release / deinit will effectively use different deallocation lists.

Forgive me for the tangent: I tested the ARC and MRC (manual reference counting) version of this simple app that creates and drops a linked list a few times (keeping the list length within limits of the current ARC implementation). It did this in obj-c for simplicity (I suppose I'd get similar results for swift implementation that uses unowned references and some external table to keep strong references to the nodes).

ARC app
// ARC app
#import <Foundation/Foundation.h>

@interface Chained: NSObject {
    @public Chained* next;
}
@end

@implementation Chained
@end

void test(void) {
    NSLog(@"start");
    double allocTime = 0;
    double deallocTime = 0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Chained* first = nil;
        for (int i = 1; i < 30000; i++) {
            Chained* next = [Chained new];
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;
        
        start = CFAbsoluteTimeGetCurrent();
        first = nil;
        deallocTime += CFAbsoluteTimeGetCurrent() - start;
    }
    NSLog(@"allocTime: %f, deallocTime: %f, totalTime: %f", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 21.279008, deallocTime: 32.589441, totalTime: 53.868450
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}
MRC app
// MRC app
//specify -fno-objc-arc for this file in build phases

#import <Foundation/Foundation.h>

@interface Chained: NSObject {
    @public Chained* next;
}
@end

@implementation Chained
@end

void test(void) {
    NSLog(@"start");
    double allocTime = 0;
    double deallocTime = 0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Chained* first = nil;
        for (int i = 1; i < 30000; i++) {
            Chained* next = [Chained new];
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;
        
        start = CFAbsoluteTimeGetCurrent();
        while (first) {
            Chained* cur = first;
            first = cur->next;
            [cur release];
        }
        deallocTime += CFAbsoluteTimeGetCurrent() - start;
    }
    NSLog(@"allocTime: %f, deallocTime: %f, totalTime: %f", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 15.840458, deallocTime: 18.275354, totalTime: 34.115811
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        test();
    }
    return 0;
}

Release mode, Intel mac:
ARC: allocTime: 21.27, deallocTime: 32.58, totalTime: 53.86
MRC: allocTime: 15.84, deallocTime: 18.27, totalTime: 34.11

oops. No free lunch with ARC.

PS. There are lies, damned lies, and benchmarks.

That any automated garbage collection (ARC or the Java way) comes with some performance penalty should not be surprising. I would have expected the mentioned "recursion" of deinits to be much more expensive, but did not notice much of a runtime difference in my own examples (with or without the code mentioned by @lorentey above to avoid a possible stack overflow). I think we do not have a runtime performance problem here, ARC is doing quite well in this respect.

3 Likes

The presence of performance degradation itself is unsurprising of course. How big it is - somewhat is; I'd assume the task at hand would be totally dominated by allocation/deallocation costs, adding, perhaps, 5% for reference count management overhead [just imagine a different test that doesn't involve allocation/deallocation, the ARC overhead would be very significant], in case objects are born, accessed and killed on a single thread. To be fair I need to replicate this test in swift and on ARM: both could have relevant optimisation improvements.

How do I mark objects immortal in Swift / Obj-C? Is it possible to do in in macOS/iOS app using standard unmodified Xcode toolchain?

Recursion itself is cheap: a function call, no dynamic memory allocation, etc. (I once saw a system where recursive calls caused memory allocation but that was an exception, normally it's not done this way.)

Any feedback on the sketch of algorithm outlined in post #23, can it fly?

Added manual memory management "swift unsafe" and C version:

Swift ARC
class StrongNode {
    var next: StrongNode? = nil
}

print("start")

var allocTime = 0.0
var deallocTime = 0.0

for _ in 0 ..< 10_000 {
    let start = CFAbsoluteTimeGetCurrent()
    var first: StrongNode? = nil
    for _ in 1...30_000 {
        let next = StrongNode()
        next.next = first
        first = next
    }
    allocTime += CFAbsoluteTimeGetCurrent() - start

    func deallocList() {
        first = nil
    }
    let start2 = CFAbsoluteTimeGetCurrent()
    deallocList()
    deallocTime += CFAbsoluteTimeGetCurrent() - start2
}

print("allocTime: \(allocTime), deallocTime: \(deallocTime), totalTime: \(allocTime + deallocTime)")
// allocTime: 20.429471373558044, deallocTime: 26.36126732826233, totalTime: 46.79073870182037
print("done")
Swift Unsafe
typealias NodePtr = UnsafeMutablePointer<Node>

struct Node {
    var next: NodePtr

    static func newNode() -> NodePtr {
        let node = calloc(MemoryLayout<Node>.size, 1)!.assumingMemoryBound(to: Node.self)
        return node
    }
    static func deleteNode(_ node: NodePtr) {
        free(node)
    }
}

print("start")

var allocTime = 0.0
var deallocTime = 0.0

for _ in 0 ..< 10_000 {
    let start = CFAbsoluteTimeGetCurrent()
    let firstNode = Node.newNode()
    var first = firstNode
    for i in 1...30_000 {
        let next = Node.newNode()
        next.pointee.next = first
        first = next
    }
    allocTime += CFAbsoluteTimeGetCurrent() - start

    let start2 = CFAbsoluteTimeGetCurrent()
    while first != firstNode {
        let cur = first
        first = cur.pointee.next
        Node.deleteNode(cur)
    }
    deallocTime += CFAbsoluteTimeGetCurrent() - start2
}

print("allocTime: \(allocTime), deallocTime: \(deallocTime), totalTime: \(allocTime + deallocTime)")
// allocTime: 14.85484755039215, deallocTime: 14.128059029579163, totalTime: 28.982906579971313

print("done")
C version
#include <stdio.h>
#include <CoreFoundation/CoreFoundation.h>

typedef struct NodeRec* Node;

struct NodeRec {
    Node next;
};

Node newNode(void) {
    Node node = malloc(sizeof(Node));
    node->next = NULL;
    return node;
}

void deleteNode(Node node) {
    free(node);
}

void test(void) {
    printf("start\n");

    double allocTime = 0.0;
    double deallocTime = 0.0;

    for (int i = 0; i < 10000; i++) {
        double start = CFAbsoluteTimeGetCurrent();
        Node first = NULL;
        for (int i = 0; i < 30000; i++) {
            Node next = newNode();
            next->next = first;
            first = next;
        }
        allocTime += CFAbsoluteTimeGetCurrent() - start;

        double start2 = CFAbsoluteTimeGetCurrent();
        while (first) {
            Node cur = first;
            first = cur->next;
            deleteNode(cur);
        }
        deallocTime += CFAbsoluteTimeGetCurrent() - start2;
    }

    printf("allocTime: %f, deallocTime: %f, totalTime: %f\n", allocTime, deallocTime, allocTime + deallocTime);
    // allocTime: 13.667640, deallocTime: 13.962520, totalTime: 27.630160

    printf("done\n");
}

int main(int argc, const char * argv[]) {
    test();
    return 0;
}

Obj-C ARC: allocTime: 21.27, deallocTime: 32.58, totalTime: 53.86
Obj-C MRC: allocTime: 15.84, deallocTime: 18.27, totalTime: 34.11
Swift ARC: allocTime: 20.43, deallocTime: 26.36, totalTime: 46.79
Swift Unsafe (*): allocTime: 14.85, deallocTime: 14.12, totalTime: 28.98
C version: allocTime: 13.66, deallocTime: 13.96, totalTime: 27.63

(*) - this uses raw memory data structures, unsafe pointers + malloc/free, quite different version compared to others.

(release mode, intel mac)

Edit: fixed bugs above and added "Swift unsafe" manual memory management version for completeness.

Edit 2: added C version.

1 Like