[Discussion] New refcount representation

gparker42 · March 16, 2016, 6:59am

I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

Today, each object stores 64-bits of refcounts and flags after the isa field.

In this new system, each object would store a pointer-size field after the isa field. This field would have two cases: it could store refcounts and flags, or it could store a pointer to a side allocation that would store refcounts and flags and additional per-object data.

Advantages:
* Saves 4 bytes per object on 32-bit for most objects.
* Improves refcount overflow and underflow detection.
* Might allow an inlineable retain/release fast path in the future.
* Allows a new weak reference implementation that doesn't need to keep entire dead objects alive.
* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

Disadvantages:
* Basic RR operations might be slower on x86_64. This needs to be measured. ARM architectures are probably unchanged.

···

----

The MSB bit would distinguish between the fastest-path in-object retain/release and everything else. Objects that use some other RR path would have that bit set. This would include objects whose refcount is stored in the side allocation and objects whose refcount does not change because they are allocated on the stack or in read-only memory.

The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:

retain:
    intptr_t oldRC = obj->rc
    newRC = oldRC + RC_ONE // sets MSB on overflow; MSB already set for other special cases
    if (newRC >= 0) {
        CAS(obj->rc = oldRC => newRC)
    } else {
        call slow path
        // out-of-object refcount (MSB bits 0b10x)
        // or refcount has overflowed (MSB bits 0b111)
        // or refcount is constant (MSB bits 0b110)
    }

release:
    intptr_t oldRC = obj->rc
    newRC = oldRC - RC_ONE // sets MSB on overflow; MSB already set for other special cases
    if (newRC >= 0) {
        CAS(obj->rc = oldRC => newRC)
    } else {
        call slow path
        // dealloc (MSB bits 0b111)
        // or out-of-object refcount (MSB bits 0b10x)
        // or refcount has underflowed (MSB bits 0b111 and deallocating bit already set)
        // or refcount is constant (MSB bits 0b110)
    }

There are some fussy bit representation details here to make sure that a pre-existing MSB=1 does not become 0 after an increment or decrement.

(In the more distant future this fast path could be inlineable while preserving ABI flexibility: if worse comes to worse we can set the MSB all the time and force inliners to fall back to the slow path runtime function. We don't want to do this yet though.)

The side allocation could be used for:
* New weak reference implementation that doesn't need to keep entire dead objects alive.
* Associated references or class extensions with instance variables
* Full-size strong refcount and unowned refcount on 32-bit architectures
* Future concurrency data or debugging instrumentation data

The Objective-C runtime uses a side table for similar purposes. It has the disadvantage that retrieving an object's side allocation requires use of a global hash table, which is slow and requires locking. This scheme would be faster and contention-free.

Installing a side allocation on an object would be a one-way operation for thread-safety reasons. For example, an object might be given a side allocation when it is first weakly referenced, but the object would not go back to in-object refcounts if the weak reference went away. Most objects would not need a side allocation.

----

Weak references could be implemented using the side allocation. A weak variable would point to the object's side allocation. The side allocation would store a pointer to the object and a strong refcount and a weak refcount. (This weak refcount would be distinct from the unowned refcount.) The weak refcount would be incremented once for every weak variable holding this object.

The advantage of using a side allocation for weak references is that the storage for a weakly-referenced object could be freed synchronously when deinit completes. Only the small side allocation would remain, backing the weak variables until they are cleared on their next access. This is a memory improvement over today's scheme, which keeps the object's entire storage alive for a potentially long time.

The hierarchy:
  Strong refcount goes to zero: deinit
  Unowned refcount goes to zero: free the object
  Weak refcount goes to zero: free the side allocation

When a weakly-referenced object is destroyed, it would free its own storage but leave the side allocation alive until all of the weak references go away.

When a weak variable is read, it would go to the side table first and atomically increment the strong refcount if the deallocating bit were not set. Then it would return the object pointer stored in the side allocation. If the deallocating bit was set, it would atomically decrement the weak refcount and free the side allocation if it reaches zero. (There is another race here that probably requires separate side bits for object-is-deallocating and object-is-deallocated.)

When an old value is erased from a weak variable, it would atomically decrement the weak refcount in the side allocation and free the side allocation if it reaches zero.

When a new value is stored to a weak variable is written, it would install a side allocation if necessary, then check the deallocating bit in the side allocation. If the object is not deallocating it would atomically increment the weak refcount.

----

RR fast paths in untested x86_64 assembly (AT&T syntax, destination on the right):

retain_fast:
   // object in %rdi
   mov 8(%rdi), %rax
1: mov %rax, %rdx
   add $0x200000000, %rdx
   bmi retain_slow
   lock,cmpxchg %rdx, 8(%rdi)
   bne 1b

release_fast:
   // object in %rdi
   mov 8(%rdi), %rax
1: mov %rax, %rdx
   sub $0x200000000, %rdx
   bmi release_slow
   lock,cmpxchg %rdx, 8(%rdi)
   bne 1b

RR fast paths in untested arm64 assembly

retain_fast:
   // object in x0
   add x1, x0, #8
1: ldxr x2, [x1]
   mov x3, #0x200000000
   adds x2, x2, x3
   b.mi retain_slow
   stxr w4, x2, [x1]
   cbz w4, 1b

release_fast:
   // object in x0
   add x1, x0, #8
1: ldxr x2, [x1]
   mov x3, #0x200000000
   subs x2, x2, x3
   b.mi release_slow
   stlxr w4, x2, [x1]
   cbz w4, 1b

--
Greg Parker gparker@apple.com Runtime Wrangler

Joe_Groff · March 16, 2016, 4:25pm

This sounds awesome. Should we still consider using a non-pointer isa representation on 64 bit platforms? 16 bytes per object header is still kinda big. If we laid out the np-isa bits so that the "side allocation" bit were the MSB, and the strong refcount immediately below it, we could pull the same trick letting the strong refcount overflow into the side allocation bit. Drawbacks would be that we'd need to go back to a global sidetable to reference the overflow allocation, and that on Apple platforms, we'd need to follow whatever non-pointer-isa layout the Objective-C runtime uses, and the exact layout of the bits would have to remain runtime-private and thus not inlineable. If we're running on the assumption that side allocations are rarely needed, then paying for the lock in the rare case might be worth a potentially substantial memory savings. On platforms where we don't need ObjC interop, maybe we could avoid the need for the sidetable with non-pointer-isa by cloning the heap object's vtable into the side allocation and changing the masked isa pointer to refer to the side allocation. That would make the side allocation larger, and would make operations that extract the type metadata from a class for generics a bit more expensive, but would let vtable lookups remain fast.

-Joe

···

On Mar 15, 2016, at 11:59 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:

I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

Today, each object stores 64-bits of refcounts and flags after the isa field.

In this new system, each object would store a pointer-size field after the isa field. This field would have two cases: it could store refcounts and flags, or it could store a pointer to a side allocation that would store refcounts and flags and additional per-object data.

Advantages:
* Saves 4 bytes per object on 32-bit for most objects.
* Improves refcount overflow and underflow detection.
* Might allow an inlineable retain/release fast path in the future.
* Allows a new weak reference implementation that doesn't need to keep entire dead objects alive.
* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

Disadvantages:
* Basic RR operations might be slower on x86_64. This needs to be measured. ARM architectures are probably unchanged.

----

The MSB bit would distinguish between the fastest-path in-object retain/release and everything else. Objects that use some other RR path would have that bit set. This would include objects whose refcount is stored in the side allocation and objects whose refcount does not change because they are allocated on the stack or in read-only memory.

The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:

retain:
   intptr_t oldRC = obj->rc
   newRC = oldRC + RC_ONE // sets MSB on overflow; MSB already set for other special cases
   if (newRC >= 0) {
       CAS(obj->rc = oldRC => newRC)
   } else {
       call slow path
       // out-of-object refcount (MSB bits 0b10x)
       // or refcount has overflowed (MSB bits 0b111)
       // or refcount is constant (MSB bits 0b110)
   }

release:
   intptr_t oldRC = obj->rc
   newRC = oldRC - RC_ONE // sets MSB on overflow; MSB already set for other special cases
   if (newRC >= 0) {
       CAS(obj->rc = oldRC => newRC)
   } else {
       call slow path
       // dealloc (MSB bits 0b111)
       // or out-of-object refcount (MSB bits 0b10x)
       // or refcount has underflowed (MSB bits 0b111 and deallocating bit already set)
       // or refcount is constant (MSB bits 0b110)
   }

There are some fussy bit representation details here to make sure that a pre-existing MSB=1 does not become 0 after an increment or decrement.

(In the more distant future this fast path could be inlineable while preserving ABI flexibility: if worse comes to worse we can set the MSB all the time and force inliners to fall back to the slow path runtime function. We don't want to do this yet though.)

The side allocation could be used for:
* New weak reference implementation that doesn't need to keep entire dead objects alive.
* Associated references or class extensions with instance variables
* Full-size strong refcount and unowned refcount on 32-bit architectures
* Future concurrency data or debugging instrumentation data

The Objective-C runtime uses a side table for similar purposes. It has the disadvantage that retrieving an object's side allocation requires use of a global hash table, which is slow and requires locking. This scheme would be faster and contention-free.

Installing a side allocation on an object would be a one-way operation for thread-safety reasons. For example, an object might be given a side allocation when it is first weakly referenced, but the object would not go back to in-object refcounts if the weak reference went away. Most objects would not need a side allocation.

----

Weak references could be implemented using the side allocation. A weak variable would point to the object's side allocation. The side allocation would store a pointer to the object and a strong refcount and a weak refcount. (This weak refcount would be distinct from the unowned refcount.) The weak refcount would be incremented once for every weak variable holding this object.

The advantage of using a side allocation for weak references is that the storage for a weakly-referenced object could be freed synchronously when deinit completes. Only the small side allocation would remain, backing the weak variables until they are cleared on their next access. This is a memory improvement over today's scheme, which keeps the object's entire storage alive for a potentially long time.

The hierarchy:
Strong refcount goes to zero: deinit
Unowned refcount goes to zero: free the object
Weak refcount goes to zero: free the side allocation

When a weakly-referenced object is destroyed, it would free its own storage but leave the side allocation alive until all of the weak references go away.

When a weak variable is read, it would go to the side table first and atomically increment the strong refcount if the deallocating bit were not set. Then it would return the object pointer stored in the side allocation. If the deallocating bit was set, it would atomically decrement the weak refcount and free the side allocation if it reaches zero. (There is another race here that probably requires separate side bits for object-is-deallocating and object-is-deallocated.)

When an old value is erased from a weak variable, it would atomically decrement the weak refcount in the side allocation and free the side allocation if it reaches zero.

When a new value is stored to a weak variable is written, it would install a side allocation if necessary, then check the deallocating bit in the side allocation. If the object is not deallocating it would atomically increment the weak refcount.

----

RR fast paths in untested x86_64 assembly (AT&T syntax, destination on the right):

retain_fast:
  // object in %rdi
  mov 8(%rdi), %rax
1: mov %rax, %rdx
  add $0x200000000, %rdx
  bmi retain_slow
  lock,cmpxchg %rdx, 8(%rdi)
  bne 1b

release_fast:
  // object in %rdi
  mov 8(%rdi), %rax
1: mov %rax, %rdx
  sub $0x200000000, %rdx
  bmi release_slow
  lock,cmpxchg %rdx, 8(%rdi)
  bne 1b

RR fast paths in untested arm64 assembly

retain_fast:
  // object in x0
  add x1, x0, #8
1: ldxr x2, [x1]
  mov x3, #0x200000000
  adds x2, x2, x3
  b.mi retain_slow
  stxr w4, x2, [x1]
  cbz w4, 1b

release_fast:
  // object in x0
  add x1, x0, #8
1: ldxr x2, [x1]
  mov x3, #0x200000000
  subs x2, x2, x3
  b.mi release_slow
  stlxr w4, x2, [x1]
  cbz w4, 1b

--
Greg Parker gparker@apple.com Runtime Wrangler

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Brent_Royal-Gordon · March 16, 2016, 8:42pm

This is damned clever.

* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

It sounds like you wouldn't be able to resize an out-of-band refcount allocation once it's been created, though, right? There are pointers to it all over memory; you can't track them all unless you save back references to all of them. So if you added, say, a refcount allocation-based feature for extension ivars, every refcount allocation would need to have the extra pointers to accommodate them in case you dynamically loaded code later that added an ivar in an extension. You couldn't go "whoops, need space for ivars, let's replace this recount allocation with one that has room for another pointer".

The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:

You don't talk much about it, but I think you're suggesting that the strong and unowned refcounts would still be packed into the same fields. It's worth noting that your clever high-bit-indicates-slow-path trick would *not* work on the unowned refcount, though that matters less since it doesn't get checked as often.

       // out-of-object refcount (MSB bits 0b10x)
       // or refcount has overflowed (MSB bits 0b111)
       // or refcount is constant (MSB bits 0b110)

Is this going to work on 32-bit? Even if we assume that valid pointers don't start with 0b1, can we assume that they don't start with 1b01?

The side allocation would store a pointer to the object and a strong refcount and a weak refcount. (This weak refcount would be distinct from the unowned refcount.) The weak refcount would be incremented once for every weak variable holding this object.

The advantage of using a side allocation for weak references is that the storage for a weakly-referenced object could be freed synchronously when deinit completes. Only the small side allocation would remain, backing the weak variables until they are cleared on their next access. This is a memory improvement over today's scheme, which keeps the object's entire storage alive for a potentially long time.

The hierarchy:
Strong refcount goes to zero: deinit
Unowned refcount goes to zero: free the object
Weak refcount goes to zero: free the side allocation

I take it the refcount field's pointer is itself considered a weak reference, so the weak refcount starts +1 like the unowned refcount does?

···

--
Brent Royal-Gordon
Architechies

John_McCall · March 16, 2016, 9:23pm

There are a lot of really good ideas. I would like to see some quasi-specifics about the proposed bit layout of the inline reference count, however. You said that there would still be an inline unowned reference count; is that now tracked independently of the weak reference count? And do we still have an efficient fast path for testing uniquely-reference-or-pinned?

John.

···

On Mar 15, 2016, at 11:59 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:
I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

gparker42 · March 17, 2016, 6:29am

I wrote a performance mockup of the fast path. It simply checks the MSB in the appropriate places in RefCount.h but does not actually implement any side allocation. I ran it on some RR-heavy benchmarks (QuickSort, InsertionSort, HeapSort, Array2D) on x86_64 and arm64.

arm64 is in fact approximately unchanged. Any difference either way is much less than 1%.

x86_64 is measurably slower:
   1% QuickSort
   2% InsertionSort
   4% Array2D
   5% HeapSort

···

On Mar 15, 2016, at 11:59 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:

I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

Today, each object stores 64-bits of refcounts and flags after the isa field.

In this new system, each object would store a pointer-size field after the isa field. This field would have two cases: it could store refcounts and flags, or it could store a pointer to a side allocation that would store refcounts and flags and additional per-object data.

Advantages:
* Saves 4 bytes per object on 32-bit for most objects.
* Improves refcount overflow and underflow detection.
* Might allow an inlineable retain/release fast path in the future.
* Allows a new weak reference implementation that doesn't need to keep entire dead objects alive.
* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

Disadvantages:
* Basic RR operations might be slower on x86_64. This needs to be measured. ARM architectures are probably unchanged.

--
Greg Parker gparker@apple.com Runtime Wrangler

Greg_Titus · March 16, 2016, 8:49pm

This is damned clever.

Yes, I agree!

The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:

You don't talk much about it, but I think you're suggesting that the strong and unowned refcounts would still be packed into the same fields. It's worth noting that your clever high-bit-indicates-slow-path trick would *not* work on the unowned refcount, though that matters less since it doesn't get checked as often.

My understanding is that there would not be an unowned refcount here (inline in the object). The first time you make a weak reference to the object it gets converted to have the out-of-band refcount structure, and the strong and unowned refcounts both get stored there (in an undefined-by-this-proposal way).

      // out-of-object refcount (MSB bits 0b10x)
      // or refcount has overflowed (MSB bits 0b111)
      // or refcount is constant (MSB bits 0b110)

Is this going to work on 32-bit? Even if we assume that valid pointers don't start with 0b1, can we assume that they don't start with 1b01?

The out-of-object refcount flag here is only 2 bits (0b10 in the high bits), so presumably you store the other 2 bits of the pointer in the low bits which would always be zero, assuming word-aligned pointers. So in 32-bit you can just left shift by 2 and still have the whole normal pointer length available.

- Greg

···

On Mar 16, 2016, at 1:42 PM, Brent Royal-Gordon via swift-dev <swift-dev@swift.org> wrote:

gparker42 · March 16, 2016, 9:24pm

This is damned clever.

* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

It sounds like you wouldn't be able to resize an out-of-band refcount allocation once it's been created, though, right? There are pointers to it all over memory; you can't track them all unless you save back references to all of them. So if you added, say, a refcount allocation-based feature for extension ivars, every refcount allocation would need to have the extra pointers to accommodate them in case you dynamically loaded code later that added an ivar in an extension. You couldn't go "whoops, need space for ivars, let's replace this recount allocation with one that has room for another pointer".

Yes, every side allocation would need to include all of the storage needed for any side allocation feature. That size is sufficiently small, I think; perhaps something like this:

struct {
    void* object;
    uint32_t strongRefCount;
    uint32_t unownedRefCount;
    uint32_t weakRefCount;
    void* associatedReferenceDictionary;
    void* extensionIvarDictionary;
}

The MSB bit also becomes set if you increment or decrement a retain count too far. That means we can implement the RR fast path with a single conditional branch after the increment or decrement:

You don't talk much about it, but I think you're suggesting that the strong and unowned refcounts would still be packed into the same fields. It's worth noting that your clever high-bit-indicates-slow-path trick would *not* work on the unowned refcount, though that matters less since it doesn't get checked as often.

Right. Unowned refcount would be slower than strong refcount. Strong refcount is by far the most common, so we're willing to tilt pretty much any scale in its favor. Unowned refcount would still be in the object, though, so it's still faster than a weak reference.

Alternatively we could reduce the implementation distinction between unowned references and weak references. Only the strong refcount would be in the object. (That in turn would help Joe's dream of a pointer-size object header.) That might make sense if (1) unowned references are sufficiently rare that they don't provoke too many side allocations, and (2) side table weak references are sufficiently fast that they can be used for unowned references too. My fear is that #1 is not true enough.

Anyone want to hack up a custom runtime that counts some object statistics, and run some non-trivial non-benchmark app with it? Something like these:
1. count of objects allocated
2. count of objects that were ever referenced unowned, but never referenced weakly
3. count of objects that were ever referenced weakly

      // out-of-object refcount (MSB bits 0b10x)
      // or refcount has overflowed (MSB bits 0b111)
      // or refcount is constant (MSB bits 0b110)

Is this going to work on 32-bit? Even if we assume that valid pointers don't start with 0b1, can we assume that they don't start with 1b01?

The *low* two bits of the side allocation address are clear (assuming that the side allocation is 4-byte aligned). We can store 0b10 in the high two bits and (side allocation >> 2) in the low 30 bits.

(Bonus: using a shift instead of a mask to recover the pointer is one instruction shorter on armv7, I think.)

(And no, we cannot assume that the MSB of all 32-bit pointers is clear. On 32-bit we can only cheat with the low bits.)

The side allocation would store a pointer to the object and a strong refcount and a weak refcount. (This weak refcount would be distinct from the unowned refcount.) The weak refcount would be incremented once for every weak variable holding this object.

The advantage of using a side allocation for weak references is that the storage for a weakly-referenced object could be freed synchronously when deinit completes. Only the small side allocation would remain, backing the weak variables until they are cleared on their next access. This is a memory improvement over today's scheme, which keeps the object's entire storage alive for a potentially long time.

The hierarchy:
Strong refcount goes to zero: deinit
Unowned refcount goes to zero: free the object
Weak refcount goes to zero: free the side allocation

I take it the refcount field's pointer is itself considered a weak reference, so the weak refcount starts +1 like the unowned refcount does?

You need to do something to make sure the side allocation is not freed if the object is live with no weak references to it. Biasing the weak refcount might work.

It might be more robust to do something else, though. Ideally an incorrect weak reference decrement would deliberately log an underflow error and halt, instead of quietly freeing the side allocation out from under a live object.

···

On Mar 16, 2016, at 1:42 PM, Brent Royal-Gordon <brent@architechies.com> wrote:

--
Greg Parker gparker@apple.com Runtime Wrangler

gparker42 · March 16, 2016, 11:08pm

Packing everything into a single 64-bit field is tricky but might work.

I think OS X and Linux x86_64 is the worst case currently, with 3 low bits and 17 high bits available. Here's what libobjc stores in its isa field on x86_64:

1 bit is nonpointer
1 bit has associated references
1 bit has destructor
44 bits class pointer
6 bits magic for tools
1 bit is weakly referenced
1 bit is deallocating
1 bit has additional refcount in side allocation
8 bits refcount

Some of those bits can be combined or reduced:

is-nonpointer is unnecessary as long as at least one of the bits outside the class pointer is set. There may be binary compatibility problems here, though. (Do existing Swift apps check that bit directly?)

has-associated-references and is-weakly-referenced and has-additional-refcount could be folded into a single has-sidetable bit.

magic is needed for tools like `leaks` to distinguish real objects from non-object allocations. If the isa field is not a simple pointer then there are otherwise too many false objects for `leaks` to be usable. 6 bits is pushing the bare minimum here, but we might be able to trim this by making the tools smarter and more invasive. (For example: if the has-sidetable bit is set then look for a side allocation pointing back to the object. If you don't find one then it's not a real object.) Being able to run `leaks` and `heap` on an unprepared process is an important capability, and I don't think we can afford to cripple it.

refcount can always be trimmed. I don't know what the shape of the "objects whose refcount is ever greater than X" curve looks like.

Given the above, we can claw back enough bits for things like a small unowned refcount and is-pinned or is-nonstructurally-mutating bits.

My biggest concern is future Swift concurrency support. I suspect we will want storage for a thread ID or a lock. Maybe those would be uncommon enough that they can live in a side allocation. Will we want a thread ID in every thread-local object in production code so we can assert that it has not incorrectly escaped, or can we leave things like that for debug builds only?

One scheme that I want to investigate for ObjC's use is to allocate a flat array of side allocations, and store an index into that array in the isa field's extra bits. Depending on architecture that allows between 256K and 64M objects with side allocations before falling back to a global table. If that works well enough then packing everything into a single pointer-size field becomes more plausible.

Joe, did you measure the cost of a mask operation at every Swift virtual call? I can't remember.

···

On Mar 16, 2016, at 9:25 AM, Joe Groff <jgroff@apple.com> wrote:

This sounds awesome. Should we still consider using a non-pointer isa representation on 64 bit platforms? 16 bytes per object header is still kinda big. If we laid out the np-isa bits so that the "side allocation" bit were the MSB, and the strong refcount immediately below it, we could pull the same trick letting the strong refcount overflow into the side allocation bit. Drawbacks would be that we'd need to go back to a global sidetable to reference the overflow allocation, and that on Apple platforms, we'd need to follow whatever non-pointer-isa layout the Objective-C runtime uses, and the exact layout of the bits would have to remain runtime-private and thus not inlineable. If we're running on the assumption that side allocations are rarely needed, then paying for the lock in the rare case might be worth a potentially substantial memory savings.

--
Greg Parker gparker@apple.com Runtime Wrangler

gparker42 · March 17, 2016, 3:18am

I have been imagining something like this:

32-bit object without side allocation:
   (MSB)
       2 bits 00 (ordinary) or 11 (strong refcount is constant)
      20 bits strong refcount
       1 bit is deallocating
       1 bit is pinned
       8 bits unowned refcount and any other flags
   (LSB)

64-bit object without side allocation:
   (MSB)
       2 bits 00 (ordinary) or 11 (strong refcount is constant)
      28 bits strong refcount
       1 bit is deallocating
       1 bit is pinned
      32 bits unowned refcount and any other flags
   (LSB)

Object with side allocation:
   (MSB)
       2 bits 10
   ptr-2 bits side allocation >> 2
   (LSB)

Side allocation:
     ptr bits object
      30 bits strong refcount
       1 bit is deallocating
       1 bit is pinned
      32 bits unowned refcount
      32 bits weak refcount
     ptr bits associated reference table
     ptr bits extension ivars
               ...etc

The uniquely-referenced-or-pinned check can use the same 1-bit rotate trick on 64-bit. On 32-bit it needs an extra shift first to throw away the unowned refcount and other flags. (We might be able to eliminate that shift if there are no other flags, by rearranging the fields and assuming that the unowned refcount will be zero for a uniquely-referenced object.)

On both architectures a constant-refcount object will appear to be multiply referenced. That sounds like the correct behavior for read-only objects (you wouldn't want to modify them in place anyway) but perhaps not the correct behavior for stack-allocated objects.

···

On Mar 16, 2016, at 2:23 PM, John McCall <rjmccall@apple.com> wrote:

On Mar 15, 2016, at 11:59 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:
I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

There are a lot of really good ideas. I would like to see some quasi-specifics about the proposed bit layout of the inline reference count, however. You said that there would still be an inline unowned reference count; is that now tracked independently of the weak reference count? And do we still have an efficient fast path for testing uniquely-reference-or-pinned?

--
Greg Parker gparker@apple.com Runtime Wrangler

Arnold · March 17, 2016, 4:29pm

I think maybe we also want to measure how a cmpxchg vs lck;add solution performs under contention.

Objects in read only state (be it cow’ed value types or not) might be shared between threads with the expectation that they are fast, i.e. the argument if retain/release are contented something else is wrong does not apply.

A load/cmpxchg might sent two memory coherence messages (one for shared/exclusive for the load/ one for modified for the cmpxchg) and mispredicted branches (pipeline flush) on state change under contention might exhibit worse performance than a lck;add (one coherence message M, no misprediced branch) depending on how things are implemented under the hood. There is always an opportunity for another level of coherence speculation ….

Other benefits we get might outweigh any such cost though.

···

On Mar 16, 2016, at 11:29 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:

On Mar 15, 2016, at 11:59 PM, Greg Parker via swift-dev <swift-dev@swift.org> wrote:

I am considering a new representation for Swift refcounts and other per-object data. This is an outline of the scheme. Comments and suggestions welcome.

Today, each object stores 64-bits of refcounts and flags after the isa field.

In this new system, each object would store a pointer-size field after the isa field. This field would have two cases: it could store refcounts and flags, or it could store a pointer to a side allocation that would store refcounts and flags and additional per-object data.

Advantages:
* Saves 4 bytes per object on 32-bit for most objects.
* Improves refcount overflow and underflow detection.
* Might allow an inlineable retain/release fast path in the future.
* Allows a new weak reference implementation that doesn't need to keep entire dead objects alive.
* Allows inexpensive per-object storage for future features like associated references or class extensions with instance variables.

Disadvantages:
* Basic RR operations might be slower on x86_64. This needs to be measured. ARM architectures are probably unchanged.

I wrote a performance mockup of the fast path. It simply checks the MSB in the appropriate places in RefCount.h but does not actually implement any side allocation. I ran it on some RR-heavy benchmarks (QuickSort, InsertionSort, HeapSort, Array2D) on x86_64 and arm64.

arm64 is in fact approximately unchanged. Any difference either way is much less than 1%.

x86_64 is measurably slower:
  1% QuickSort
  2% InsertionSort
  4% Array2D
  5% HeapSort

--
Greg Parker gparker@apple.com Runtime Wrangler

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Brent_Royal-Gordon · March 16, 2016, 9:35pm

You don't talk much about it, but I think you're suggesting that the strong and unowned refcounts would still be packed into the same fields. It's worth noting that your clever high-bit-indicates-slow-path trick would *not* work on the unowned refcount, though that matters less since it doesn't get checked as often.

My understanding is that there would not be an unowned refcount here (inline in the object). The first time you make a weak reference to the object it gets converted to have the out-of-band refcount structure, and the strong and unowned refcounts both get stored there (in an undefined-by-this-proposal way).

Note this line in the proposal:

(This weak refcount would be distinct from the unowned refcount.)

My understanding is that there are three refcounts:

* The (new) weak refcount represents claims on the refcount allocation. It (perhaps) starts +1 because the object's refcount field is a claim on the refcount allocation. When it goes to 0, the refcount allocation is freed.
* The unowned refcount represents claims on the object's allocation. It starts +1 because the strong refcount is a claim on the object's allocation. When it goes to 0, the refcount allocation's backreference is nulled (if there is one), the weak refcount is decremented (if it exists), and the object is freed.
* The strong refcount represents ownership of the object. It starts +1 because the initializer owns the object. When it goes to 0, the object is marked as deinitializing, `deinit`s are run, and the unowned refcount is decremented.

(The backreference might be cut when the strong refcount goes to zero; it doesn't make much difference either way.)

It sounds to me like this proposal is saying:

* There is a pointer-sized refcount field in each object.
* If the high bit is clear, ~half of its bytes are used for the strong refcount and the rest are used for the unowned refcount, with a few bits somewhere used for various flags.
* If the high bit is set, the remaining bits are a pointer to an allocation containing strong, unowned, and weak refcounts, plus any other stuff we might add later.
* We switch from inline refcount to pointer to refcount when one of the refcounts overflow, a weak reference is formed, or one of the future features we might add later are used. We switch from pointer to refcount to inline refcount never.

···

--
Brent Royal-Gordon
Architechies

Brent_Royal-Gordon · March 16, 2016, 9:41pm

I take it the refcount field's pointer is itself considered a weak reference, so the weak refcount starts +1 like the unowned refcount does?

You need to do something to make sure the side allocation is not freed if the object is live with no weak references to it. Biasing the weak refcount might work.

It might be more robust to do something else, though. Ideally an incorrect weak reference decrement would deliberately log an underflow error and halt, instead of quietly freeing the side allocation out from under a live object.

The backreference from the side allocation could be used to indicate the object's liveness. Require the object to null that pointer before it performs its weak release; then if the weak refcount goes to zero before the backreference is null, you've had an unbalanced release.

···

--
Brent Royal-Gordon
Architechies

John_Calsbeek · March 17, 2016, 4:20am

One scheme that I want to investigate for ObjC's use is to allocate a flat array of side allocations, and store an index into that array in the isa field's extra bits. Depending on architecture that allows between 256K and 64M objects with side allocations before falling back to a global table. If that works well enough then packing everything into a single pointer-size field becomes more plausible.

Joe, did you measure the cost of a mask operation at every Swift virtual call? I can't remember.

This is a totally harebrained idea, but: if a mask at every call site is acceptable and you admit restrictions about where in the virtual address space vtables can be allocated, *and* you do the flat array of side allocations, you can get back a lot of bits.

Example (for 64-bit):

  (MSB)
     1 bit is deallocating
     1 bit is pinned
    18 bits side allocation index
    20 bits vtable pointer high
     2 bits 00 (ordinary) or 11 (strong refcount is constant)
    10 bits strong refcount
     4 bits vtable pointer low
     8 bits unowned refcount
  (LSB)

Mask this with 0x00000fffff000f00 and you get a pointer with 256-byte alignment that can range over a 44-bit address space with a stride of 16 MB—hopefully enough flexibility to find some unused virtual address space. Unfortunately with this layout you can only fit 16 vtables in one 4k page, so it’s not particularly TLB-friendly.

I can think of several dozen ways this might fall over, plus it’s starting to be a pretty complicated layout to inline and make ABI-stable, but it’s an idea.

My biggest concern is future Swift concurrency support. I suspect we will want storage for a thread ID or a lock. Maybe those would be uncommon enough that they can live in a side allocation. Will we want a thread ID in every thread-local object in production code so we can assert that it has not incorrectly escaped, or can we leave things like that for debug builds only?

I don’t know anything about the possible future concurrency use cases, but I certainly hope we won’t need a lock in every object in the fast path.

Joe_Groff · March 17, 2016, 4:26pm

Yeah. The effect of the mask operation was unmeasurable on the Macbook Pro and iPhone 5s I was testing on. Masking using a resilient mask value loaded from a dylib did have a noticeable impact on an Apple Watch, though.

-Joe

···

On Mar 16, 2016, at 4:08 PM, Greg Parker <gparker@apple.com> wrote:

On Mar 16, 2016, at 9:25 AM, Joe Groff <jgroff@apple.com> wrote:

This sounds awesome. Should we still consider using a non-pointer isa representation on 64 bit platforms? 16 bytes per object header is still kinda big. If we laid out the np-isa bits so that the "side allocation" bit were the MSB, and the strong refcount immediately below it, we could pull the same trick letting the strong refcount overflow into the side allocation bit. Drawbacks would be that we'd need to go back to a global sidetable to reference the overflow allocation, and that on Apple platforms, we'd need to follow whatever non-pointer-isa layout the Objective-C runtime uses, and the exact layout of the bits would have to remain runtime-private and thus not inlineable. If we're running on the assumption that side allocations are rarely needed, then paying for the lock in the rare case might be worth a potentially substantial memory savings.

Packing everything into a single 64-bit field is tricky but might work.

I think OS X and Linux x86_64 is the worst case currently, with 3 low bits and 17 high bits available. Here's what libobjc stores in its isa field on x86_64:

1 bit is nonpointer
1 bit has associated references
1 bit has destructor
44 bits class pointer
6 bits magic for tools
1 bit is weakly referenced
1 bit is deallocating
1 bit has additional refcount in side allocation
8 bits refcount

Some of those bits can be combined or reduced:

is-nonpointer is unnecessary as long as at least one of the bits outside the class pointer is set. There may be binary compatibility problems here, though. (Do existing Swift apps check that bit directly?)

has-associated-references and is-weakly-referenced and has-additional-refcount could be folded into a single has-sidetable bit.

magic is needed for tools like `leaks` to distinguish real objects from non-object allocations. If the isa field is not a simple pointer then there are otherwise too many false objects for `leaks` to be usable. 6 bits is pushing the bare minimum here, but we might be able to trim this by making the tools smarter and more invasive. (For example: if the has-sidetable bit is set then look for a side allocation pointing back to the object. If you don't find one then it's not a real object.) Being able to run `leaks` and `heap` on an unprepared process is an important capability, and I don't think we can afford to cripple it.

refcount can always be trimmed. I don't know what the shape of the "objects whose refcount is ever greater than X" curve looks like.

Given the above, we can claw back enough bits for things like a small unowned refcount and is-pinned or is-nonstructurally-mutating bits.

My biggest concern is future Swift concurrency support. I suspect we will want storage for a thread ID or a lock. Maybe those would be uncommon enough that they can live in a side allocation. Will we want a thread ID in every thread-local object in production code so we can assert that it has not incorrectly escaped, or can we leave things like that for debug builds only?

One scheme that I want to investigate for ObjC's use is to allocate a flat array of side allocations, and store an index into that array in the isa field's extra bits. Depending on architecture that allows between 256K and 64M objects with side allocations before falling back to a global table. If that works well enough then packing everything into a single pointer-size field becomes more plausible.

Joe, did you measure the cost of a mask operation at every Swift virtual call? I can't remember.