Resilient dynamic dispatch ABI. Notes and mini-proposal.


(Andrew Trick) #1

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

···

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

  method_index = immediate
  { // common per-class method lookup
    isa = load[obj]
    isa = isa & @isa_mask
    offset = load[@class_method_table + method_index]
    if (isVtableOffset(offset))
      method_entry = load[isa + offset]
    else
      method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
  }
  call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

  method_index = immediate
  { // common per-class method lookup
    isa = load[obj]
    offset = load[@local_class_method_table + method_index]
    if (!isInitializedOffset(offset)) {
      offset = @resolveMethodOffset(@class_id, method_index)
      store [@local_class_method_table + method_index]
    }
    if (isVtableOffset(offset))
      method_entry = load[isa + offset]
    else
      method_entry = @resolveMethodAddress(isa, @class_id, method_index)
  }
  call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy


(John McCall) #2

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Probably the most important such case is that many of these "dispatch" symbols
for a non-open class could simply be aliases to the actual method definition.

Note that open classes have to support super-dispatch, not just ordinary dynamic
dispatch; that's something that needs to be discussed at the same time, since it
may affect the trade-offs. #1 has some serious disadvantages here; the others
are likely fine, since their mechanisms are all easily parameterized by the isa.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

As a note, I went down a garden path here — I didn't realize at first that #3
wasn't an option on its own, but really just forematter for options #3a and #3b.
The paragraph about the sort, especially coming after #2, made me think that
you were talking about the option — let's call it #2b — where the class only
exports an offset to the start of its v-table and an availability-sort allows the
caller to hard-code a fixed offset to add to that. This greatly cuts down on the
number of symbols required to be exported, but of course it does force all
the methods to actually end up in the v-table.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

Why would this need to be a function? Just to allow its resolution to be lazy?
It seems to me that the class definition always knows the size of this table.

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   isa = isa & @isa_mask
   offset = load[@class_method_table + method_index]
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

I hope the proposal is that the stuff in braces is a locally-emitted function,
not something inlined into every call site.

A sort of #4 is that this function could be exported by the class definition.
It would be passed an isa and a method index and could resolve it however it likes.

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   offset = load[@local_class_method_table + method_index]
   if (!isInitializedOffset(offset)) {
     offset = @resolveMethodOffset(@class_id, method_index)
     store [@local_class_method_table + method_index]
   }
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

The size of @local_class_method_table is not statically knowable.
Fortunately, it doesn't matter, because this mechanism does not actually
care about the table being contiguous; the lookup function could be
passed a per-method cache variable. This would also allow the lookup
function to be shared between classes.

John.

···

On Feb 2, 2017, at 9:57 PM, Andrew Trick via swift-dev <swift-dev@swift.org> wrote:

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Joe Groff) #3

Given that most open-coded resilient method lookup paths require an extra load dependency to grab the method offset before loading the method address itself, we might possibly consider indirecting the vtables for each class, so that the top-level vtable contains [address of root class vtable, address of first child class vtable, etc.]. If a class hierarchy is fixed once exported (in other words, you can't insert a superclass into an inheritance chain without an ABI break), then the offset into each superclass's pointer in the vtable would be statically known, and the offset into each second-level vtable could be statically known via sorting by availability. This somewhat matches how we lay out protocol witness tables, where each parent protocol's witness table is indirectly referenced instead of being inlined into the leaf witness table. (OTOH, method offsets can be cached and reused to dispatch the same method on different objects, whereas we would have to perform the load chain once per object per method with this approach.)

-Joe

···

On Feb 2, 2017, at 6:57 PM, Andrew Trick <atrick@apple.com> wrote:

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   isa = isa & @isa_mask
   offset = load[@class_method_table + method_index]
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   offset = load[@local_class_method_table + method_index]
   if (!isInitializedOffset(offset)) {
     offset = @resolveMethodOffset(@class_id, method_index)
     store [@local_class_method_table + method_index]
   }
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy


(Karl) #4

I have a question about current dispatching behaviour with protocols and ‘Self’.

protocol CustomEquatable {
    func equal(to: Self) -> Bool
}

open class Super : CustomEquatable {
    func equal(to: Super) -> Bool { print("super"); return false }
}
class Sub: Super {
    func equal(to: Sub) -> Bool { print("sub-sub"); return true }
    override func equal(to: Super) -> Bool { print("sub-super"); return true }
}

Sub().equal(to: Sub()) // sub-sub
Sub().equal(to: Super()) // sub-super
Super().equal(to: Sub()) // super
Super().equal(to: Super()) // super

(Sub() as Super).equal(to: Sub) // sub-super — dynamically dispatched to callee’s type, not param
(Sub() as Super).equal(to: (Sub() as Super)) // sub-super — as above

Currently, we dynamically dispatch to the callee’s type to find ‘Self’, but we don’t apply that consistently when dispatching to ‘Self’-type parameters. Is that expected behaviour?

- Karl

···

On 3 Feb 2017, at 03:57, Andrew Trick via swift-dev <swift-dev@swift.org> wrote:

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   isa = isa & @isa_mask
   offset = load[@class_method_table + method_index]
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   offset = load[@local_class_method_table + method_index]
   if (!isInitializedOffset(offset)) {
     offset = @resolveMethodOffset(@class_id, method_index)
     store [@local_class_method_table + method_index]
   }
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Andrew Trick) #5

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Probably the most important such case is that many of these "dispatch" symbols
for a non-open class could simply be aliases to the actual method definition.

Note that open classes have to support super-dispatch, not just ordinary dynamic
dispatch; that's something that needs to be discussed at the same time, since it
may affect the trade-offs. #1 has some serious disadvantages here; the others
are likely fine, since their mechanisms are all easily parameterized by the isa.

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

As a note, I went down a garden path here — I didn't realize at first that #3
wasn't an option on its own, but really just forematter for options #3a and #3b.
The paragraph about the sort, especially coming after #2, made me think that
you were talking about the option — let's call it #2b — where the class only
exports an offset to the start of its v-table and an availability-sort allows the
caller to hard-code a fixed offset to add to that. This greatly cuts down on the
number of symbols required to be exported, but of course it does force all
the methods to actually end up in the v-table.

Right, I didn't go down that path initially because, as Joe pointed
out, the vtable offsets aren't statically knowable because of the
resilient base class problem. But, of course the vtable base offset
could be exported and we can rely on the class being realized before
it's invoked. This is great if we're willing to commit to vtable
dispatch.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

Why would this need to be a function? Just to allow its resolution to be lazy?
It seems to me that the class definition always knows the size of this table.

I don't know what you're asking. In #3b, the offset is resolved by a
class-exported function.

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  isa = isa & @isa_mask
  offset = load[@class_method_table + method_index]
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

I hope the proposal is that the stuff in braces is a locally-emitted function,
not something inlined into every call site.

This approach is based on generating local per-class "nocallersave" helper functions
for method lookup (all the stuff in curly braces).

A sort of #4 is that this function could be exported by the class definition.
It would be passed an isa and a method index and could resolve it however it likes.

Yeah, let's call that option #4. I like that from an ABI perspective, but didn't go there
because it seems strictly worse than #1 in terms of runtime cost.

-Andy

···

On Feb 3, 2017, at 10:58 AM, John McCall <rjmccall@apple.com> wrote:

On Feb 2, 2017, at 9:57 PM, Andrew Trick via swift-dev <swift-dev@swift.org> wrote:

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  offset = load[@local_class_method_table + method_index]
  if (!isInitializedOffset(offset)) {
    offset = @resolveMethodOffset(@class_id, method_index)
    store [@local_class_method_table + method_index]
  }
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

The size of @local_class_method_table is not statically knowable.
Fortunately, it doesn't matter, because this mechanism does not actually
care about the table being contiguous; the lookup function could be
passed a per-method cache variable. This would also allow the lookup
function to be shared between classes.

John.

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Andrew Trick) #6

Hmm... I thought the local method offset table size could be statically knowable because it will only be accesd for methods that were publicly available at build time, based on the sorted method index.

We could simplify the method resolution API with a single exported symbol per-class (maybe that's what you're getting at):

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

The problem with that is the client-side code can’t hoist and combine the method offset lookup anymore.

-Andy

···

On Feb 3, 2017, at 11:55 AM, Andrew Trick via swift-dev <swift-dev@swift.org> wrote:

#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
isa = load[obj]
offset = load[@local_class_method_table + method_index]
if (!isInitializedOffset(offset)) {
   offset = @resolveMethodOffset(@class_id, method_index)
   store [@local_class_method_table + method_index]
}
if (isVtableOffset(offset))
   method_entry = load[isa + offset]
else
   method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

The size of @local_class_method_table is not statically knowable.
Fortunately, it doesn't matter, because this mechanism does not actually
care about the table being contiguous; the lookup function could be
passed a per-method cache variable. This would also allow the lookup
function to be shared between classes.


(John McCall) #7

#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
isa = load[obj]
offset = load[@local_class_method_table + method_index]
if (!isInitializedOffset(offset)) {
   offset = @resolveMethodOffset(@class_id, method_index)
   store [@local_class_method_table + method_index]
}
if (isVtableOffset(offset))
   method_entry = load[isa + offset]
else
   method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

The size of @local_class_method_table is not statically knowable.
Fortunately, it doesn't matter, because this mechanism does not actually
care about the table being contiguous; the lookup function could be
passed a per-method cache variable. This would also allow the lookup
function to be shared between classes.

Hmm... I thought the local method offset table size could be statically knowable because it will only be accesd for methods that were publicly available at build time, based on the sorted method index.

Sure, I mean, you can pick a table size that's sufficient for all the offsets that will be passed. That would save a few instructions at call sites, at the cost of requiring the resolvers to be per-class.

We could simplify the method resolution API with a single exported symbol per-class (maybe that's what you're getting at):

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

Right, this is what I was thinking.

The problem with that is the client-side code can’t hoist and combine the method offset lookup anymore.

Why not? vtable_offset is basically an opaque cache here. Sure, technically the call isn't readnone anymore, but it's an innocuous violation like adding memoization to sin().

Or is there some finer-grained hoisting you're imagining than the entire call?

John.

···

On Feb 3, 2017, at 4:18 PM, Andrew Trick <atrick@apple.com> wrote:

On Feb 3, 2017, at 11:55 AM, Andrew Trick via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:


(John McCall) #8

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Probably the most important such case is that many of these "dispatch" symbols
for a non-open class could simply be aliases to the actual method definition.

Note that open classes have to support super-dispatch, not just ordinary dynamic
dispatch; that's something that needs to be discussed at the same time, since it
may affect the trade-offs. #1 has some serious disadvantages here; the others
are likely fine, since their mechanisms are all easily parameterized by the isa.

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

As a note, I went down a garden path here — I didn't realize at first that #3
wasn't an option on its own, but really just forematter for options #3a and #3b.
The paragraph about the sort, especially coming after #2, made me think that
you were talking about the option — let's call it #2b — where the class only
exports an offset to the start of its v-table and an availability-sort allows the
caller to hard-code a fixed offset to add to that. This greatly cuts down on the
number of symbols required to be exported, but of course it does force all
the methods to actually end up in the v-table.

Right, I didn't go down that path initially because, as Joe pointed
out, the vtable offsets aren't statically knowable because of the
resilient base class problem. But, of course the vtable base offset
could be exported and we can rely on the class being realized before
it's invoked. This is great if we're willing to commit to vtable
dispatch.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

Why would this need to be a function? Just to allow its resolution to be lazy?
It seems to me that the class definition always knows the size of this table.

I don't know what you're asking. In #3b, the offset is resolved by a
class-exported function.

You said "A single per-class entry point provides a pointer to an offset table",
which suggests that, well, there's a function that provides (i.e. returns) a pointer
to an offset table. I'm asking why accessing the offset table requires a function
call instead of just accessing a global variable. If you're worried about initializing it,
I think it's clearly better to rely on it having been initialized as part of initializing the
class metadata, which means that (optimization aside) call sites can always just
assume it's been initialized by the point of call.

Now, that does create a more complex optimization problem because the access
isn't arbitrarily hoistable, the same way that hoisting an ivar offset load isn't arbitrarily
hoistable, but, well, as that example shows, this is an optimization problem we
already have and need to get better at, and it's not a total blocker.

method_index = immediate
{ // common per-class method lookup
isa = load[obj]
isa = isa & @isa_mask
offset = load[@class_method_table + method_index]
if (isVtableOffset(offset))
   method_entry = load[isa + offset]
else
   method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

I hope the proposal is that the stuff in braces is a locally-emitted function,
not something inlined into every call site.

This approach is based on generating local per-class "nocallersave" helper functions
for method lookup (all the stuff in curly braces).

Okay, good.

A sort of #4 is that this function could be exported by the class definition.
It would be passed an isa and a method index and could resolve it however it likes.

Yeah, let's call that option #4. I like that from an ABI perspective, but didn't go there
because it seems strictly worse than #1 in terms of runtime cost.

Well, it doesn't require per-method symbols, but yes, it definitely gives up some static
optimizations relative to #1. Most of these do.

I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:

I. A call site:
  I1) inlines the lookup
  I2) calls a function that performs the lookup and returns a function pointer
  I3) calls a function that performs the lookup and calls the function
I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted. I1 requires at least some details of the lookup to be ABI.

II. The function that performs the lookup:
II1) is emitted lazily
II2) is emitted eagerly
II2 can use static information to implement the lookup inline without affecting the ABI. II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.

III. The function that performs the lookup:
  III1) is method-specific
  III2) is class-specific
  III3) is part of the language runtime
III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition. III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).

IV. The function that performs the lookup:
  IV1) is parameterized by an isa
  IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).

V. For any particular function or piece of information, it can be accessed:
  V1) directly through a symbol
  V2) through a class-specific table
  V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.

So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?

John.

···

On Feb 3, 2017, at 2:55 PM, Andrew Trick <atrick@apple.com> wrote:

On Feb 3, 2017, at 10:58 AM, John McCall <rjmccall@apple.com> wrote:

On Feb 2, 2017, at 9:57 PM, Andrew Trick via swift-dev <swift-dev@swift.org> wrote:


(John McCall) #9

Given that most open-coded resilient method lookup paths require an extra load dependency to grab the method offset before loading the method address itself, we might possibly consider indirecting the vtables for each class, so that the top-level vtable contains [address of root class vtable, address of first child class vtable, etc.]. If a class hierarchy is fixed once exported (in other words, you can't insert a superclass into an inheritance chain without an ABI break), then the offset into each superclass's pointer in the vtable would be statically known, and the offset into each second-level vtable could be statically known via sorting by availability. This somewhat matches how we lay out protocol witness tables, where each parent protocol's witness table is indirectly referenced instead of being inlined into the leaf witness table. (OTOH, method offsets can be cached and reused to dispatch the same method on different objects, whereas we would have to perform the load chain once per object per method with this approach.)

Great point.

I'm still uncomfortable with the idea of assuming that we can't insert a superclass into an inheritance chain. This isn't an assumption that's otherwise necessary or even useful, unless we decide to start optimizing dynamic casts.

Assuming it's valid, some additional trade-offs that come to mind:
  - It adds a load dependency to non-resilient dispatch, which is probably what we should be optimizing for. We have an easy answer when someone asks why their resilient dispatch is a bit slower. We don't have easy ways to make non-resilient dispatch faster.
  - It still forces us to put every non-final method in the subtable.
  + It would be possible to share subtables between classes, e.g. when a subclass doesn't override anything from a particular parent.
  + It would allow us to put the subtables in constant memory.

John.

···

On Feb 3, 2017, at 7:12 PM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:

-Joe

On Feb 2, 2017, at 6:57 PM, Andrew Trick <atrick@apple.com> wrote:

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  isa = isa & @isa_mask
  offset = load[@class_method_table + method_index]
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  offset = load[@local_class_method_table + method_index]
  if (!isInitializedOffset(offset)) {
    offset = @resolveMethodOffset(@class_id, method_index)
    store [@local_class_method_table + method_index]
  }
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Andrew Trick) #10

Given that most open-coded resilient method lookup paths require an extra load dependency to grab the method offset before loading the method address itself, we might possibly consider indirecting the vtables for each class, so that the top-level vtable contains [address of root class vtable, address of first child class vtable, etc.]. If a class hierarchy is fixed once exported (in other words, you can't insert a superclass into an inheritance chain without an ABI break), then the offset into each superclass's pointer in the vtable would be statically known, and the offset into each second-level vtable could be statically known via sorting by availability. This somewhat matches how we lay out protocol witness tables, where each parent protocol's witness table is indirectly referenced instead of being inlined into the leaf witness table. (OTOH, method offsets can be cached and reused to dispatch the same method on different objects, whereas we would have to perform the load chain once per object per method with this approach.)

-Joe

I want to avoid introducing dependent loads for both the resilient and non-resilient cases. obj->isa->target is already the critical path. We don’t want obj->isa->vtable->target. I’d rather go through thunks and do the lookup on the framework side.

The interesting question for me is whether to reserve any bits in the cached offset for future alternative dispatch mechanisms or table layouts. I doubt it’s worth the extra bitmask instruction that would be needed for cache lookup.

-Andy

···

On Feb 3, 2017, at 4:12 PM, Joe Groff <jgroff@apple.com> wrote:

On Feb 2, 2017, at 6:57 PM, Andrew Trick <atrick@apple.com> wrote:

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  isa = isa & @isa_mask
  offset = load[@class_method_table + method_index]
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
  isa = load[obj]
  offset = load[@local_class_method_table + method_index]
  if (!isInitializedOffset(offset)) {
    offset = @resolveMethodOffset(@class_id, method_index)
    store [@local_class_method_table + method_index]
  }
  if (isVtableOffset(offset))
    method_entry = load[isa + offset]
  else
    method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy


(Andrew Trick) #11

Hi Karl,

This is expected behavior. The choice of method overload is based on the static types: `Super.equal(to: Super)`. The dynamic dispatch that you get when using the override keyword is based on the `self` argument’s dynamic type only. The dynamic types of the non-self arguments don’t play a part.

I’m sure there are some good language design reasons not to support dynamic method overloads. I can only tell you that implementing it would be no fun.

-Andy

···

On Feb 5, 2017, at 7:45 AM, Karl Wagner <razielim@gmail.com> wrote:

I have a question about current dispatching behaviour with protocols and ‘Self’.

protocol CustomEquatable {
    func equal(to: Self) -> Bool
}

open class Super : CustomEquatable {
    func equal(to: Super) -> Bool { print("super"); return false }
}
class Sub: Super {
    func equal(to: Sub) -> Bool { print("sub-sub"); return true }
    override func equal(to: Super) -> Bool { print("sub-super"); return true }
}

Sub().equal(to: Sub()) // sub-sub
Sub().equal(to: Super()) // sub-super
Super().equal(to: Sub()) // super
Super().equal(to: Super()) // super

(Sub() as Super).equal(to: Sub) // sub-super — dynamically dispatched to callee’s type, not param
(Sub() as Super).equal(to: (Sub() as Super)) // sub-super — as above

Currently, we dynamically dispatch to the callee’s type to find ‘Self’, but we don’t apply that consistently when dispatching to ‘Self’-type parameters. Is that expected behaviour?

- Karl


(Joe Groff) #12

It strikes me as a bug that we accept the `func equal(to: Sub)` method quietly as an overload instead of raising an error, or at least a warning about it. Nonetheless, this is correct behavior given that the code is accepted: `equal(to: Sub)` is a different method from `equal(to: Super)` that's only visible when we statically have a `Sub`. It isn't used as the protocol witness, since the witness is chosen for `Super` and has to work for all its subclasses, and it isn't an override of `equal(to: Super)`, since it only accepts `Sub`s and not all `Super`s.

-Joe

···

On Feb 5, 2017, at 7:45 AM, Karl Wagner via swift-dev <swift-dev@swift.org> wrote:

On 3 Feb 2017, at 03:57, Andrew Trick via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

I'm following up on a resilient dynamic dispatch discussion kicked off by
Slava during a performance team meeting to summarize some key
points on public [swift-dev].

It's easy to get sidetracked by the details of dynamic
dispatch and various ways to generate code. I suggest approaching the
problem by focusing on the ABI aspects and flexibility the ABI affords
for future optimization. I'm including a proposal for one specific
approach (#3) that wasn't discussed yet.

---
#1. (thunk export) The simplest, most flexible way to expose dispatch
across resilience boundaries is by exporting a single per-method entry
point. Future compilers could improve dispatch and gradually expose
more ABI details.

Cost: We're forced to export all those symbols in perpetuity.

[The cost of the symbols is questionable. The symbol trie should compress the
names, so the size may be small, and they should be lazily resolved,
so the startup cost should be amortized].

---
#2. (offset export) An alternative approach was proposed by JoeG a
while ago and revisited in the meeting yesterday. It involves a
client-side vtable offset lookup helper.

This allows more opportunity for micro-optimization on the client
side. This exposes the isa-based vtable mechanism as ABI. However, it
stops short of exposing the vtable layout itself. Guaranteeing vtable
dispatch may become a problem in the future because it forces an
explosion of metadata. It also has the same problem as #1 because the
framework must export a per-method symbol for the dispatch
offset. What's worse, the symbols need to be eagerly resolved (AFAIK).

---
#3. (method index) This is an alternative that I've alluded to before,
but was not discussed in yesterday's meeting. One that makes a
tradeoff between exporting symbols vs. exposing vtable layout. I want
to focus on direct cost of the ABI support and flexibility of this
approach vs. approach #1 without arguing over how to micro-optimize
various dispatching schemes. Here's how it works:

The ABI specifies a sort function for public methods that gives each
one a per-class index. Version availability takes sort precedence, so
public methods can be added without affecting other
indices. [Apparently this is the same approach we're taking with
witness tables].

As with #2 this avoids locking down the vtable format for now--in the
future we'll likely optimize it further. To avoid locking all methods
into the vtable mechanism, the offset can be tagged. The alternative
dispatch mechanism for tagged offsets will be hidden within the
class-defining framework.

This avoids the potential explosion of exported symbols--it's limited
to one per public class. It avoids explosion of metadata by allowing
alternative dispatch for some subset of methods. These tradeoffs can
be explored in the future, independent of the ABI.

---
#3a. (offset table export) A single per-class entry point provides a
pointer to an offset table. [It can be optionally cached on the client
side].

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   isa = isa & @isa_mask
   offset = load[@class_method_table + method_index]
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
}
call method_entry

Cost - client code size: Worst case 3 instructions to dispatch vs 1
instruction for approach #1. Method lookups can be combined, so groups
of calls will be more compact.

Cost - library size: the offset tables themselves need to be
materialized on the framework side. I believe this can be done
statically in read-only memory, but that needs to be verified.

ABI: The offset table format and tag bit are baked into the ABI.

---
#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
   isa = load[obj]
   offset = load[@local_class_method_table + method_index]
   if (!isInitializedOffset(offset)) {
     offset = @resolveMethodOffset(@class_id, method_index)
     store [@local_class_method_table + method_index]
   }
   if (isVtableOffset(offset))
     method_entry = load[isa + offset]
   else
     method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

ABI: This avoids exposing the offset table format as ABI. All that's
needed is a symbol for the class, a single entry point for method
offset resolution, and a single entry point for non-vtable method
resolution.

Benefit: The library no longer needs to statically materialize
tables. Instead they are initialized lazilly in each client module.

Cost: Lazy initialization of local tables requires an extra check and
burns some code size.

---
Caveat:

This is the first time I've thought through approach #3, and it hasn't
been discussed, so there are likely a few things I'm missing at the
moment.

---
Side Note:

Regardless of the resilient dispatch mechanism, within a module the
dispatch mechanism should be implemented with thunks to avoid type
checking classes from other files and improve compile time in non-WMO
builds, as Slava requested.

-Andy
_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

I have a question about current dispatching behaviour with protocols and ‘Self’.

protocol CustomEquatable {
    func equal(to: Self) -> Bool
}

open class Super : CustomEquatable {
    func equal(to: Super) -> Bool { print("super"); return false }
}
class Sub: Super {
    func equal(to: Sub) -> Bool { print("sub-sub"); return true }
    override func equal(to: Super) -> Bool { print("sub-super"); return true }
}

Sub().equal(to: Sub()) // sub-sub
Sub().equal(to: Super()) // sub-super
Super().equal(to: Sub()) // super
Super().equal(to: Super()) // super

(Sub() as Super).equal(to: Sub) // sub-super — dynamically dispatched to callee’s type, not param
(Sub() as Super).equal(to: (Sub() as Super)) // sub-super — as above

Currently, we dynamically dispatch to the callee’s type to find ‘Self’, but we don’t apply that consistently when dispatching to ‘Self’-type parameters. Is that expected behaviour?


(Andrew Trick) #13

I was briefly over-thinking the problem… we could hoist the part that wasn’t dependent on the `isa`. That’s probably not important, and if it were, we could still hoist the cache lookup.

This option does seem to have the lowest immediate ABI cost (one symbol per class) with a tremendous amount of flexibility for optimizing both the dispatch code and the metadata representation.

-Andy

···

On Feb 3, 2017, at 1:27 PM, John McCall <rjmccall@apple.com> wrote:

On Feb 3, 2017, at 4:18 PM, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:

On Feb 3, 2017, at 11:55 AM, Andrew Trick via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

#3b. (lazy resolution) Offset tables can be completely localized.

method_index = immediate
{ // common per-class method lookup
isa = load[obj]
offset = load[@local_class_method_table + method_index]
if (!isInitializedOffset(offset)) {
   offset = @resolveMethodOffset(@class_id, method_index)
   store [@local_class_method_table + method_index]
}
if (isVtableOffset(offset))
   method_entry = load[isa + offset]
else
   method_entry = @resolveMethodAddress(isa, @class_id, method_index)
}
call method_entry

The size of @local_class_method_table is not statically knowable.
Fortunately, it doesn't matter, because this mechanism does not actually
care about the table being contiguous; the lookup function could be
passed a per-method cache variable. This would also allow the lookup
function to be shared between classes.

Hmm... I thought the local method offset table size could be statically knowable because it will only be accesd for methods that were publicly available at build time, based on the sorted method index.

Sure, I mean, you can pick a table size that's sufficient for all the offsets that will be passed. That would save a few instructions at call sites, at the cost of requiring the resolvers to be per-class.

We could simplify the method resolution API with a single exported symbol per-class (maybe that's what you're getting at):

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

Right, this is what I was thinking.

The problem with that is the client-side code can’t hoist and combine the method offset lookup anymore.

Why not? vtable_offset is basically an opaque cache here. Sure, technically the call isn't readnone anymore, but it's an innocuous violation like adding memoization to sin().

Or is there some finer-grained hoisting you're imagining than the entire call?

John.


(Andrew Trick) #14

I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:

That's great (sorry I used up the arabic+letter naming convention earlier in the thread)...

I. A call site:
I1) inlines the lookup
I2) calls a function that performs the lookup and returns a function pointer
I3) calls a function that performs the lookup and calls the function
I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted. I1 requires at least some details of the lookup to be ABI.

Exactly.

II. The function that performs the lookup:
II1) is emitted lazily
II2) is emitted eagerly
II2 can use static information to implement the lookup inline without affecting the ABI. II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.

Sure, but I was thinking of it as client side vs. class-definition
side. My suggestions explored the tradeoff between II1 and II2, where
the client code could optionally optimize for vtable dispatch by
emitting some lazy lookup helpers with a fall-back to the eagerly
emitted lookup.

III. The function that performs the lookup:
III1) is method-specific
III2) is class-specific
III3) is part of the language runtime
III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition. III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).

III1 is the sort of optimization that can be selectively added later
if we're willing to version the ABI.

I suppose by that logic we should start with III3. We could never
ditch the runtime/compiler support, but newer code could move toward
III2 through ABI versioning.

I actually prefer to start with III2 because the static overhead is
not significantly worse than III3 and it can be optimized without ABI
migration.

IV. The function that performs the lookup:
IV1) is parameterized by an isa
IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).

In my first message I was trying to accomplish IV1. But IV2 is simpler
and I can't see a fundamental advantage to IV1. Why would it need a
lookup chain?

V. For any particular function or piece of information, it can be accessed:
V1) directly through a symbol
V2) through a class-specific table
V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.

I had proposed V2 as an option, but am strongly leaning toward V1 for
ABI simplicity and lower static costs (why generate vtables and offset
tables?)

So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?

I'll pick the following option as a starting point because it constrains the ABI the least in
terms of static costs and potential directions for optimization:

"I2; (II1+II2); III2; IV1; V1"

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

(where both modules would need to opt into the vtable_offset.)

I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.

-Andy

···

On Feb 3, 2017, at 3:12 PM, John McCall <rjmccall@apple.com> wrote:


(John McCall) #15

I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:

That's great (sorry I used up the arabic+letter naming convention earlier in the thread)...

I. A call site:
I1) inlines the lookup
I2) calls a function that performs the lookup and returns a function pointer
I3) calls a function that performs the lookup and calls the function
I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted. I1 requires at least some details of the lookup to be ABI.

Exactly.

II. The function that performs the lookup:
II1) is emitted lazily
II2) is emitted eagerly
II2 can use static information to implement the lookup inline without affecting the ABI. II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.

Sure, but I was thinking of it as client side vs. class-definition
side. My suggestions explored the tradeoff between II1 and II2, where
the client code could optionally optimize for vtable dispatch by
emitting some lazy lookup helpers with a fall-back to the eagerly
emitted lookup.

That's how I see it, too.

III. The function that performs the lookup:
III1) is method-specific
III2) is class-specific
III3) is part of the language runtime
III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition. III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).

III1 is the sort of optimization that can be selectively added later
if we're willing to version the ABI.

Well, at the cost of leaving us with a legacy exported function.

I suppose by that logic we should start with III3. We could never
ditch the runtime/compiler support, but newer code could move toward
III2 through ABI versioning.

It would depend on whether we build features on top of this. For example,
I could see a reflection system assuming that all methods could be dispatched
via the global runtime function, if we had one.

I actually prefer to start with III2 because the static overhead is
not significantly worse than III3 and it can be optimized without ABI
migration.

If it's between III2 and III3, yeah, I think I agree.

IV. The function that performs the lookup:
IV1) is parameterized by an isa
IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).

In my first message I was trying to accomplish IV1. But IV2 is simpler
and I can't see a fundamental advantage to IV1.

Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)
by passing in the isa of either the superclass or the current class. IV2 means
that the dispatch function is always based on the isa from the object, so those
dispatch schemes need something else to implement them.

Why would it need a lookup chain?

Code size, because you might not want to inline the isa load at every call site.
So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
that just loads the isa and calls the IV1 function (defined by the class).

V. For any particular function or piece of information, it can be accessed:
V1) directly through a symbol
V2) through a class-specific table
V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.

I had proposed V2 as an option, but am strongly leaning toward V1 for
ABI simplicity and lower static costs (why generate vtables and offset
tables?)

V1 doesn't remove the need for tables, it just hides them from the ABI.

So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?

I'll pick the following option as a starting point because it constrains the ABI the least in
terms of static costs and potential directions for optimization:

"I2; (II1+II2); III2; IV1; V1"

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

(where both modules would need to opt into the vtable_offset.)

Wait, remind me what this &vtable_offset is for at this point? Is it basically just a client-side cache? I can't figure out what it's doing for us.

I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.

That's a lot of stuff to materialize at every call site. It makes calls
into something like a 10 instruction sequence on ARM64, ignoring
the actual formal arguments:

  %raw_isa = load %object // 1 instruction
  %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask // 1
  %method_index = 13 // 1
  %cache = @local.A.foo.cache // 2: not necessarily within ±1MB
  %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
  call %method(...) // 1

On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq
is ±2GB, which is Good Enough for the standard code model, but of course it still
expands to roughly the same amount of code.

Even without vtable_offset, it's a lot of code to inline.

So we'd almost certainly want a client-side resolver function that handled
the normal case. Is that what you mean when you say II1+II2? So the local
resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction
call sequence, which I think is equivalent to Objective-C, and that function
would do this sequence:

define @local_resolveMethodAddress(%object, %method_index)
  %raw_isa = load %object // 1 instruction
  %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask // 1
  %cache_table = @local.A.cache_table // 2: not necessarily within ±1MB
  %cache = add %cache_table, %method_index * 8 // 1
  tailcall @A.resolveMethod(%isa, %method_index, %cache) // 1

John.

···

On Feb 3, 2017, at 7:06 PM, Andrew Trick <atrick@apple.com> wrote:

On Feb 3, 2017, at 3:12 PM, John McCall <rjmccall@apple.com <mailto:rjmccall@apple.com>> wrote:


(Dave Abrahams) #16

Does that mean we don't need to make such an assumption for protocols
either? That would be a big relief for me, if so.

···

on Fri Feb 03 2017, John McCall <swift-dev-AT-swift.org> wrote:

On Feb 3, 2017, at 7:12 PM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:
Given that most open-coded resilient method lookup paths require an
extra load dependency to grab the method offset before loading the
method address itself, we might possibly consider indirecting the
vtables for each class, so that the top-level vtable contains

[address of root class vtable, address of first child class vtable,
etc.]. If a class hierarchy is fixed once exported (in other words,
you can't insert a superclass into an inheritance chain without an
ABI break), then the offset into each superclass's pointer in the
vtable would be statically known, and the offset into each
second-level vtable could be statically known via sorting by
availability. This somewhat matches how we lay out protocol witness
tables, where each parent protocol's witness table is indirectly
referenced instead of being inlined into the leaf witness
table. (OTOH, method offsets can be cached and reused to dispatch
the same method on different objects, whereas we would have to
perform the load chain once per object per method with this
approach.)

Great point.

I'm still uncomfortable with the idea of assuming that we can't insert
a superclass into an inheritance chain. This isn't an assumption
that's otherwise necessary or even useful, unless we decide to start
optimizing dynamic casts.

--
-Dave


(Joe Groff) #17

Given that most open-coded resilient method lookup paths require an extra load dependency to grab the method offset before loading the method address itself, we might possibly consider indirecting the vtables for each class, so that the top-level vtable contains [address of root class vtable, address of first child class vtable, etc.]. If a class hierarchy is fixed once exported (in other words, you can't insert a superclass into an inheritance chain without an ABI break), then the offset into each superclass's pointer in the vtable would be statically known, and the offset into each second-level vtable could be statically known via sorting by availability. This somewhat matches how we lay out protocol witness tables, where each parent protocol's witness table is indirectly referenced instead of being inlined into the leaf witness table. (OTOH, method offsets can be cached and reused to dispatch the same method on different objects, whereas we would have to perform the load chain once per object per method with this approach.)

Great point.

I'm still uncomfortable with the idea of assuming that we can't insert a superclass into an inheritance chain. This isn't an assumption that's otherwise necessary or even useful, unless we decide to start optimizing dynamic casts.

Fair point. Jordan also noted that inserting superclasses is something that has happened in practice with Apple's frameworks in the past.

Assuming it's valid, some additional trade-offs that come to mind:
- It adds a load dependency to non-resilient dispatch, which is probably what we should be optimizing for. We have an easy answer when someone asks why their resilient dispatch is a bit slower. We don't have easy ways to make non-resilient dispatch faster.

If we have control over how the class object grows in both directions, we could potentially mitigate the fragile dispatch case, by having the subtable pointers grow in one direction, pointing into sub-vtables that are stored inline in the other direction, something like this:

so that code with full knowledge of the class hierarchy can directly address methods at static offsets, and only resilient clients need to use the subtable pointers.

-Joe

···

On Feb 3, 2017, at 8:47 PM, John McCall <rjmccall@apple.com> wrote:

On Feb 3, 2017, at 7:12 PM, Joe Groff via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:


(Andrew Trick) #18

IV. The function that performs the lookup:
IV1) is parameterized by an isa
IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).

In my first message I was trying to accomplish IV1. But IV2 is simpler
and I can't see a fundamental advantage to IV1.

Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)
by passing in the isa of either the superclass or the current class. IV2 means
that the dispatch function is always based on the isa from the object, so those
dispatch schemes need something else to implement them.

Why would it need a lookup chain?

Code size, because you might not want to inline the isa load at every call site.
So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
that just loads the isa and calls the IV1 function (defined by the class).

Right. Looks like I wrote the opposite of what I meant. The important thing to me is that the vtable offset load + check is issued in parallel with the isa load. I was originally pushing IV2 for this reason, but now think that optimization could be entirely lazy via a client-side cache.

V. For any particular function or piece of information, it can be accessed:
V1) directly through a symbol
V2) through a class-specific table
V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.

I had proposed V2 as an option, but am strongly leaning toward V1 for
ABI simplicity and lower static costs (why generate vtables and offset
tables?)

V1 doesn't remove the need for tables, it just hides them from the ABI.

I like that it makes the offset tables lazy and optional. They don’t even need to be complete.

So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?

I'll pick the following option as a starting point because it constrains the ABI the least in
terms of static costs and potential directions for optimization:

"I2; (II1+II2); III2; IV1; V1"

method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)

(where both modules would need to opt into the vtable_offset.)

Wait, remind me what this &vtable_offset is for at this point? Is it basically just a client-side cache? I can't figure out what it's doing for us.

It’s a client side cache that can be checked in parallel with the `isa` load. The resolver is not required to provide an offset, and the client does not need cache all the method offsets. It does burn an extra register, but gains the ability to implement vtable dispatch entirely on the client side.

You might be thinking of caching the method entry itself and checking `isa` within `resolveMethod`. I didn’t mention that possibility because the cost of calling the non-local `resolveMethod` function followed by an indirect call largely defeats the purpose of something like an inline-cache.

I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.

That's a lot of stuff to materialize at every call site. It makes calls
into something like a 10 instruction sequence on ARM64, ignoring
the actual formal arguments:

  %raw_isa = load %object // 1 instruction
  %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask // 1
  %method_index = 13 // 1
  %cache = @local.A.foo.cache // 2: not necessarily within ±1MB
  %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
  call %method(...) // 1

On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq
is ±2GB, which is Good Enough for the standard code model, but of course it still
expands to roughly the same amount of code.

Even without vtable_offset, it's a lot of code to inline.

So we'd almost certainly want a client-side resolver function that handled
the normal case. Is that what you mean when you say II1+II2? So the local
resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction
call sequence, which I think is equivalent to Objective-C, and that function
would do this sequence:

define @local_resolveMethodAddress(%object, %method_index)
  %raw_isa = load %object // 1 instruction
  %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask // 1
  %cache_table = @local.A.cache_table // 2: not necessarily within ±1MB
  %cache = add %cache_table, %method_index * 8 // 1
  tailcall @A.resolveMethod(%isa, %method_index, %cache) // 1

John.

Yes, exactly, except we haven’t even done any client-side vtable optimization yet.

To me the point of the local cache is to avoid calling @A.resolveMethod in the common case. So we need another load-compare-and-branch, which makes the local helper 12-13 instructions. Then you have the vtable load itself, so that’s 13-14 instructions. You would be saving on dynamic instructions but paying with 4 extra static instructions per class.

It would be lame if we can't force @local.A.cache_table to be ±1MB relative to the helper.

Inlining the cache table address might be worthwhile because %method_index would then be an immediate and hoisted to the top of the function.

-Andy

···

On Feb 3, 2017, at 9:37 PM, John McCall <rjmccall@apple.com> wrote:


(John McCall) #19

Oh, that's a cute idea.

John.

···

On Feb 6, 2017, at 1:23 PM, Joe Groff <jgroff@apple.com> wrote:

On Feb 3, 2017, at 8:47 PM, John McCall <rjmccall@apple.com <mailto:rjmccall@apple.com>> wrote:

On Feb 3, 2017, at 7:12 PM, Joe Groff via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
Given that most open-coded resilient method lookup paths require an extra load dependency to grab the method offset before loading the method address itself, we might possibly consider indirecting the vtables for each class, so that the top-level vtable contains [address of root class vtable, address of first child class vtable, etc.]. If a class hierarchy is fixed once exported (in other words, you can't insert a superclass into an inheritance chain without an ABI break), then the offset into each superclass's pointer in the vtable would be statically known, and the offset into each second-level vtable could be statically known via sorting by availability. This somewhat matches how we lay out protocol witness tables, where each parent protocol's witness table is indirectly referenced instead of being inlined into the leaf witness table. (OTOH, method offsets can be cached and reused to dispatch the same method on different objects, whereas we would have to perform the load chain once per object per method with this approach.)

Great point.

I'm still uncomfortable with the idea of assuming that we can't insert a superclass into an inheritance chain. This isn't an assumption that's otherwise necessary or even useful, unless we decide to start optimizing dynamic casts.

Fair point. Jordan also noted that inserting superclasses is something that has happened in practice with Apple's frameworks in the past.

Assuming it's valid, some additional trade-offs that come to mind:
- It adds a load dependency to non-resilient dispatch, which is probably what we should be optimizing for. We have an easy answer when someone asks why their resilient dispatch is a bit slower. We don't have easy ways to make non-resilient dispatch faster.

If we have control over how the class object grows in both directions, we could potentially mitigate the fragile dispatch case, by having the subtable pointers grow in one direction, pointing into sub-vtables that are stored inline in the other direction, something like this:

<FullSizeRender.jpeg>
so that code with full knowledge of the class hierarchy can directly address methods at static offsets, and only resilient clients need to use the subtable pointers.


(John McCall) #20

I think we can probably implement the ability to extract a super-protocol resiliently, yeah.
You'd get some requirements in redundant positions, but that's not so bad.

John.

···

On Feb 4, 2017, at 6:45 PM, Dave Abrahams via swift-dev <swift-dev@swift.org> wrote:
on Fri Feb 03 2017, John McCall <swift-dev-AT-swift.org> wrote:

On Feb 3, 2017, at 7:12 PM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:
Given that most open-coded resilient method lookup paths require an
extra load dependency to grab the method offset before loading the
method address itself, we might possibly consider indirecting the
vtables for each class, so that the top-level vtable contains

[address of root class vtable, address of first child class vtable,
etc.]. If a class hierarchy is fixed once exported (in other words,
you can't insert a superclass into an inheritance chain without an
ABI break), then the offset into each superclass's pointer in the
vtable would be statically known, and the offset into each
second-level vtable could be statically known via sorting by
availability. This somewhat matches how we lay out protocol witness
tables, where each parent protocol's witness table is indirectly
referenced instead of being inlined into the leaf witness
table. (OTOH, method offsets can be cached and reused to dispatch
the same method on different objects, whereas we would have to
perform the load chain once per object per method with this
approach.)

Great point.

I'm still uncomfortable with the idea of assuming that we can't insert
a superclass into an inheritance chain. This isn't an assumption
that's otherwise necessary or even useful, unless we decide to start
optimizing dynamic casts.

Does that mean we don't need to make such an assumption for protocols
either? That would be a big relief for me, if so.