I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:
That's great (sorry I used up the arabic+letter naming convention earlier in the thread)...
I. A call site:
I1) inlines the lookup
I2) calls a function that performs the lookup and returns a function pointer
I3) calls a function that performs the lookup and calls the function
I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted. I1 requires at least some details of the lookup to be ABI.
Exactly.
II. The function that performs the lookup:
II1) is emitted lazily
II2) is emitted eagerly
II2 can use static information to implement the lookup inline without affecting the ABI. II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.
Sure, but I was thinking of it as client side vs. class-definition
side. My suggestions explored the tradeoff between II1 and II2, where
the client code could optionally optimize for vtable dispatch by
emitting some lazy lookup helpers with a fall-back to the eagerly
emitted lookup.
That's how I see it, too.
III. The function that performs the lookup:
III1) is method-specific
III2) is class-specific
III3) is part of the language runtime
III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition. III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).
III1 is the sort of optimization that can be selectively added later
if we're willing to version the ABI.
Well, at the cost of leaving us with a legacy exported function.
I suppose by that logic we should start with III3. We could never
ditch the runtime/compiler support, but newer code could move toward
III2 through ABI versioning.
It would depend on whether we build features on top of this. For example,
I could see a reflection system assuming that all methods could be dispatched
via the global runtime function, if we had one.
I actually prefer to start with III2 because the static overhead is
not significantly worse than III3 and it can be optimized without ABI
migration.
If it's between III2 and III3, yeah, I think I agree.
IV. The function that performs the lookup:
IV1) is parameterized by an isa
IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).
In my first message I was trying to accomplish IV1. But IV2 is simpler
and I can't see a fundamental advantage to IV1.
Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)
by passing in the isa of either the superclass or the current class. IV2 means
that the dispatch function is always based on the isa from the object, so those
dispatch schemes need something else to implement them.
Why would it need a lookup chain?
Code size, because you might not want to inline the isa load at every call site.
So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
that just loads the isa and calls the IV1 function (defined by the class).
V. For any particular function or piece of information, it can be accessed:
V1) directly through a symbol
V2) through a class-specific table
V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.
I had proposed V2 as an option, but am strongly leaning toward V1 for
ABI simplicity and lower static costs (why generate vtables and offset
tables?)
V1 doesn't remove the need for tables, it just hides them from the ABI.
So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?
I'll pick the following option as a starting point because it constrains the ABI the least in
terms of static costs and potential directions for optimization:
"I2; (II1+II2); III2; IV1; V1"
method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)
(where both modules would need to opt into the vtable_offset.)
Wait, remind me what this &vtable_offset is for at this point? Is it basically just a client-side cache? I can't figure out what it's doing for us.
I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.
That's a lot of stuff to materialize at every call site. It makes calls
into something like a 10 instruction sequence on ARM64, ignoring
the actual formal arguments:
%raw_isa = load %object // 1 instruction
%isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
%isa = and %raw_isa, %isa_mask // 1
%method_index = 13 // 1
%cache = @local.A.foo.cache // 2: not necessarily within ±1MB
%method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
call %method(...) // 1
On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq
is ±2GB, which is Good Enough for the standard code model, but of course it still
expands to roughly the same amount of code.
Even without vtable_offset, it's a lot of code to inline.
So we'd almost certainly want a client-side resolver function that handled
the normal case. Is that what you mean when you say II1+II2? So the local
resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction
call sequence, which I think is equivalent to Objective-C, and that function
would do this sequence:
define @local_resolveMethodAddress(%object, %method_index)
%raw_isa = load %object // 1 instruction
%isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
%isa = and %raw_isa, %isa_mask // 1
%cache_table = @local.A.cache_table // 2: not necessarily within ±1MB
%cache = add %cache_table, %method_index * 8 // 1
tailcall @A.resolveMethod(%isa, %method_index, %cache) // 1
John.
···
On Feb 3, 2017, at 7:06 PM, Andrew Trick <atrick@apple.com> wrote:
On Feb 3, 2017, at 3:12 PM, John McCall <rjmccall@apple.com <mailto:rjmccall@apple.com>> wrote: