I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.
Today, each object stores 64-bits of refcounts and flags after the isa field.
In this new system, each object would store a pointer-size field after the isa field. This field would have two cases: it could store refcounts and flags, or it could store a pointer to a side allocation that would store refcounts and flags and additional per-object data.
Advantages:
* Saves 4 bytes per object on 32-bit for most objects.
* Improves refcount overflow and underflow detection.
* Might allow an inlineable retain/release fast path in the future.
* Allows a new weak reference implementation that doesn't need to keep entire dead objects alive.
* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.
Disadvantages:
* Basic RR operations might be slower on x86_64. This needs to be measured. ARM architectures are probably unchanged.
···
----
The MSB bit would distinguish between the fastest-path in-object retain/release and everything else. Objects that use some other RR path would have that bit set. This would include objects whose refcount is stored in the side allocation and objects whose refcount does not change because they are allocated on the stack or in read-only memory.
The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:
retain:
intptr_t oldRC = obj->rc
newRC = oldRC + RC_ONE // sets MSB on overflow; MSB already set for other special cases
if (newRC >= 0) {
CAS(obj->rc = oldRC => newRC)
} else {
call slow path
// out-of-object refcount (MSB bits 0b10x)
// or refcount has overflowed (MSB bits 0b111)
// or refcount is constant (MSB bits 0b110)
}
release:
intptr_t oldRC = obj->rc
newRC = oldRC - RC_ONE // sets MSB on overflow; MSB already set for other special cases
if (newRC >= 0) {
CAS(obj->rc = oldRC => newRC)
} else {
call slow path
// dealloc (MSB bits 0b111)
// or out-of-object refcount (MSB bits 0b10x)
// or refcount has underflowed (MSB bits 0b111 and deallocating bit already set)
// or refcount is constant (MSB bits 0b110)
}
There are some fussy bit representation details here to make sure that a pre-existing MSB=1 does not become 0 after an increment or decrement.
(In the more distant future this fast path could be inlineable while preserving ABI flexibility: if worse comes to worse we can set the MSB all the time and force inliners to fall back to the slow path runtime function. We don't want to do this yet though.)
The side allocation could be used for:
* New weak reference implementation that doesn't need to keep entire dead objects alive.
* Associated references or class extensions with instance variables
* Full-size strong refcount and unowned refcount on 32-bit architectures
* Future concurrency data or debugging instrumentation data
The Objective-C runtime uses a side table for similar purposes. It has the disadvantage that retrieving an object's side allocation requires use of a global hash table, which is slow and requires locking. This scheme would be faster and contention-free.
Installing a side allocation on an object would be a one-way operation for thread-safety reasons. For example, an object might be given a side allocation when it is first weakly referenced, but the object would not go back to in-object refcounts if the weak reference went away. Most objects would not need a side allocation.
----
Weak references could be implemented using the side allocation. A weak variable would point to the object's side allocation. The side allocation would store a pointer to the object and a strong refcount and a weak refcount. (This weak refcount would be distinct from the unowned refcount.) The weak refcount would be incremented once for every weak variable holding this object.
The advantage of using a side allocation for weak references is that the storage for a weakly-referenced object could be freed synchronously when deinit completes. Only the small side allocation would remain, backing the weak variables until they are cleared on their next access. This is a memory improvement over today's scheme, which keeps the object's entire storage alive for a potentially long time.
The hierarchy:
Strong refcount goes to zero: deinit
Unowned refcount goes to zero: free the object
Weak refcount goes to zero: free the side allocation
When a weakly-referenced object is destroyed, it would free its own storage but leave the side allocation alive until all of the weak references go away.
When a weak variable is read, it would go to the side table first and atomically increment the strong refcount if the deallocating bit were not set. Then it would return the object pointer stored in the side allocation. If the deallocating bit was set, it would atomically decrement the weak refcount and free the side allocation if it reaches zero. (There is another race here that probably requires separate side bits for object-is-deallocating and object-is-deallocated.)
When an old value is erased from a weak variable, it would atomically decrement the weak refcount in the side allocation and free the side allocation if it reaches zero.
When a new value is stored to a weak variable is written, it would install a side allocation if necessary, then check the deallocating bit in the side allocation. If the object is not deallocating it would atomically increment the weak refcount.
----
RR fast paths in untested x86_64 assembly (AT&T syntax, destination on the right):
retain_fast:
// object in %rdi
mov 8(%rdi), %rax
1: mov %rax, %rdx
add $0x200000000, %rdx
bmi retain_slow
lock,cmpxchg %rdx, 8(%rdi)
bne 1b
release_fast:
// object in %rdi
mov 8(%rdi), %rax
1: mov %rax, %rdx
sub $0x200000000, %rdx
bmi release_slow
lock,cmpxchg %rdx, 8(%rdi)
bne 1b
RR fast paths in untested arm64 assembly
retain_fast:
// object in x0
add x1, x0, #8
1: ldxr x2, [x1]
mov x3, #0x200000000
adds x2, x2, x3
b.mi retain_slow
stxr w4, x2, [x1]
cbz w4, 1b
release_fast:
// object in x0
add x1, x0, #8
1: ldxr x2, [x1]
mov x3, #0x200000000
subs x2, x2, x3
b.mi release_slow
stlxr w4, x2, [x1]
cbz w4, 1b
--
Greg Parker gparker@apple.com Runtime Wrangler