Implementing Mark-Sweep Garbage Collection in Swift

Alex_Popov · November 21, 2018, 4:17am

Hello all,

I'm not a Swift beginner, but this is my first time attempting to hack on the Swift compiler.

After reading a paper on improvements to reference counting (e.g. automatic cycle detection and coalescing of reference mutations) I became curious whether or not some of these improvements could be brought to Swift. The problem is, this paper requires a garbage collector.

My plan is to at least prototype a rudimentary mark-sweep garbage collector that runs along-side reference counting. The way I see it, I must implement the following:

Find referenced classes.

This means synthesizing a reachablePointers method for all objects, which returns the set of pointers this object references.
I don't expect this is too hard as its enumerating all class/struct properties and returning what they point to, but I'd be happy to know where to even start reading the compiler source.

Add a Mark bit to the object header of all classes.

I expect structs don't need a Mark bit.

Somehow find the root set.

This feels like a problem that I can solve by reading about SIL

Implement Mark-Sweep.

I think I could just do this in Swift, using Unmanaged to free unmarked memory and reachablePointers to recurse.
Bonus that if I extend steps 1-3 to Objective-C, Swift could handle that as well.

I want to stress that this is not a Swift Evolution post or a serious improvement to Swift itself. I posted it in "Using Swift" to not clog the very important channels. This is a research project.

I would appreciate any feedback other than "don't do it". Even RTFM would be great if it links to someplace helpful. Thank you all in advance.

Michael_Gottesman · November 21, 2018, 5:18am

COW (Copy on Write) is fundamental to Swift’s value type implementation. This AFAIK can not be implemented with a garbage collector.

EDIT: To be more specific... unless you can come up with a way to deal with COW data structures or have the mark/sweep garbage collector not apply to types that rely on COW, I do not see how you can do this without changing the semantics in subtle ways. Another issue with that is that today there isn't such a thing as a COWable protocol, so I am unsure of how you would identify such data structures in a robust way.

Alex_Popov · November 21, 2018, 6:46am

Structs cannot (AFAIK) form a reference cycle (though they may be part of one), so they do not need to be handled by the garbage collector since RC is complete over structs. Unless I’m misunderstanding then, that shouldn’t be an issue then as long as I don’t attempt to sweep them. Is anything other than structs and enums COWable?

COW is a very important point to keep in mind, I hadn’t considered its ramifications at all. Thanks!

TellowKrinkle · November 21, 2018, 7:33am

As long as you're planning on leaving the current reference counting system in, nothing should break. The main thing that COW needs is for isUniquelyReferenced to work properly, which currently relies on checking whether an object's reference count is 1 or >1. If you were planning on reducing the amount reference counting is used, that could break things.

Chris_Lattner3 · November 24, 2018, 6:11am

FWIW, I'd love to see GC implementations of swift be experimented with!

That said, I agree with Michael - I think we should start by formalizing the model for isUniquelyReferenced and COW. This could be done by dusting off some of the old COW proposals.

Another much simpler and more narrow proposal would be to implement a new @supportsIsUniquelyReferenced (better name welcome) attribute. Classes that adopt it can be used with isUniquelyReferenced, and classes without it would not.

In the case of a GC-based implementation of Swift, this would mean that classes with that attribute would have to carry a refcount, but nothing else would.

Alex_Popov · November 25, 2018, 4:21am

Thanks for the words of encouragement! This gives me a direction in which to start reading and learning, thank you.

Joe_Groff · November 25, 2018, 6:08am

Another avenue to explore would be something like Python’s implementation (at least historically), where reference counting is the primary memory management mechanism and a simple GC exists only to collect cycles. That would maintain many of the desirable properties of ARC for non-cyclic data while defining away the problems with leaks.

Aside from uniquely-referenced checks, the other thing to consider is non-trivial deinits. Since Swift with ARC provides an guaranteed upper bound to when objects will be deallocated, deinitializers can be more reliably used for managing non-memory resources, whereas finalizers in GC languages are notoriously unreliable and also pose semantic challenges for optimizers and static analyzers. If you were going to implement Swift with a more sophisticated GC as its primary memory management mechanism, I think you’d want to keep reference counting for objects with nontrivial deinits, not only objects that need to be checked for uniqueness. (An implementation with a cycle-only collector might want to emit a runtime warning when such an object is collected as part of a cycle too.)

Chris_Lattner3 · November 29, 2018, 4:38am

Agreed, a 'reference counted system with a cycle collector' would be a better way to preserve our deinit semantics, at the cost of preserving the cost of reference counting.

A mark/sweep GC with refcounting used on an opt-in basis would be a better way to get higher performance (for the cases when RC is getting you down) at the cost of losing deinit semantics.

Our long term bet is that a proper ownership model and continued improvements to ARC optimization will greatly reduce the remaining refcount overhead and make the performance side of this moot. The cycle issue will still be a real thing.

If you're interested in cycles, one of the things that would be really interesting to investigate is the development of a "Debug time only" cycle collector with the goal of detecting and reporting cycles to the app developer. Such a tool can optimize for keeping additional metadata around in important case (e.g. when capturing self) to make the reports particularly actionable and directed.

-Chris

Alex_Popov · November 29, 2018, 4:48am

I've limited the scope of my current project to runtime cycle detection through the use of leaks. I'm stuck on how the .memgraph files encode nodes and edges, but hopefully I can figure that out and make an interim tool for reporting on cycles roughly as they happen (with a huge performance penalty).

Afterwards I do want to build just the Cycle Detector you're describing. The point about maintaining additional metadata is interesting, I hadn't really considered the cases where it would be most impactful.

Thanks as always for your insight Chris.