Adding a "unique" flag to alloc_ref

zoecarver · July 15, 2020, 2:23am

@Andrew_Trick that was a fantastic reply. Thank you.

First, we do need some clear motivating examples to discuss.

Good idea.

But it isn't clear which ones are solved by the current PR:

Well... none of them. This only implements the flag, it doesn't add a pass to mark references unique, or update any existing passes to use that information.

To clear up which PRs/commits do what:

This is an old (and closed) PR that promoted stack-allocated references to have static access enforcement and "stack" memory kind. But, it shows some possible performance improvements that would be nice to have if we can achieve them in other ways.

github.com/apple/swift

[opt] Update alloc_refs on the stack to have stack memory kind.

apple:master ← zoecarver:opt/static-access-marker

opened 10:48PM - 29 Apr 20 UTC

zoecarver

+157 -4

Update MemAccessUtils to classify class allocs on the stack as "AccessedStorage:…:Kind::Stack" instead of "::Class". This allows class element access markers to be promoted to have static access enforcement. Right now we're really bad at optimizing classes. I've made a few patches recently to address some of the issues I've seen in the optimizer around class optimization. They're mostly similar patches (or at least addressing similar problems) and over time they seem to have gotten more and more general. I think this may finally be the _right_ patch. Fixing this edge case around access enforcement is (hopefully, we'll see what the benchmarks say) a big performance improvement for a relatively small patch. And, more importantly, this feels like the "correct" way to address the problem (rather than storing class members in tuples =P).

This PR adds the "unique" flag to alloc_ref. That's it.

github.com/apple/swift

[opt] Update alloc_refs on the stack to have stack memory kind.

apple:master ← zoecarver:opt/static-access-marker

opened 10:48PM - 29 Apr 20 UTC

zoecarver

+157 -4

Update MemAccessUtils to classify class allocs on the stack as "AccessedStorage:…:Kind::Stack" instead of "::Class". This allows class element access markers to be promoted to have static access enforcement. Right now we're really bad at optimizing classes. I've made a few patches recently to address some of the issues I've seen in the optimizer around class optimization. They're mostly similar patches (or at least addressing similar problems) and over time they seem to have gotten more and more general. I think this may finally be the _right_ patch. Fixing this edge case around access enforcement is (hopefully, we'll see what the benchmarks say) a big performance improvement for a relatively small patch. And, more importantly, this feels like the "correct" way to address the problem (rather than storing class members in tuples =P).

The next PR just shows a benchmark of what happens when we remove access markers in the middle of the pass pipeline. It has the same improvements as the first PR but with some added regressions... curious. I think DSE/RLE is the reason for the improvement when accesses are removed. I'm not sure why the first PR doesn't have the regressions.

github.com/apple/swift

Comment by swift-ci to [NO-MERGE] Add access enforcement opts and access marker elimination to addHighLevelModulePipeline.

apple:master ← zoecarver:tmp/opt/access-opts-and-elim

### Performance: -O **Improvement** | **OLD** | **N…EW** | **DELTA** | **RATIO** :--- | ---: | ---: | ---: | ---: ObjectiveCBridgeStubToNSDateRef | 3880 | 3440 | -11.3% | **1.13x (?)** DictionarySubscriptDefaultMutationOfObjects | 1680 | 1500 | -10.7% | **1.12x (?)** StringToDataMedium | 3900 | 3500 | -10.3% | **1.11x (?)** ### Code size: -O **Improvement** | **OLD** | **NEW** | **DELTA** | **RATIO** :--- | ---: | ---: | ---: | ---: ArrayOfRef.o | 8907 | 8795 | -1.3% | **1.01x** ### Performance: -Osize **Regression** | **OLD** | **NEW** | **DELTA** | **RATIO** :--- | ---: | ---: | ---: | ---: MapReduceLazyCollectionShort | 40 | 85 | +112.5% | **0.47x** FlattenListFlatMap | 4185 | 5941 | +42.0% | **0.70x (?)** PointerArithmetics | 31400 | 37100 | +18.2% | **0.85x** BinaryFloatingPointPropertiesBinade | 31 | 34 | +9.7% | **0.91x (?)** ProtocolDispatch | 314 | 342 | +8.9% | **0.92x**   | | | | **Improvement** | **OLD** | **NEW** | **DELTA** | **RATIO** ClassArrayGetter2 | 1830 | 130 | -92.9% | **14.08x** MapReduceClass2 | 198 | 25 | -87.4% | **7.92x** MapReduceClassShort2 | 362 | 200 | -44.8% | **1.81x** BinaryFloatingPointPropertiesNextUp | 35 | 31 | -11.4% | **1.13x (?)** MapReduce | 179 | 160 | -10.6% | **1.12x (?)** DictionarySubscriptDefaultMutationOfObjects | 1720 | 1540 | -10.5% | **1.12x (?)** ### Code size: -Osize **Improvement** | **OLD** | **NEW** | **DELTA** | **RATIO** :--- | ---: | ---: | ---: | ---: ArrayOfGenericRef.o | 8304 | 8032 | -3.3% | **1.03x** ArrayOfRef.o | 8581 | 8309 | -3.2% | **1.03x** DictionaryBridge.o | 3309 | 3221 | -2.7% | **1.03x** LinkedList.o | 2257 | 2200 | -2.5% | **1.03x** OpaqueConsumingUsers.o | 2351 | 2303 | -2.0% | **1.02x** RGBHistogram.o | 19552 | 19200 | -1.8% | **1.02x** ObjectAllocation.o | 4298 | 4242 | -1.3% | **1.01x** ### Performance: -Onone **Regression** | **OLD** | **NEW** | **DELTA** | **RATIO** :--- | ---: | ---: | ---: | ---: DataReplaceSmallBuffer | 10400 | 11700 | +12.5% | **0.89x (?)** DictionaryBridgeToObjC_Access | 857 | 958 | +11.8% | **0.89x (?)** ObjectiveCBridgeStubNSDateRefAccess | 5345 | 5790 | +8.3% | **0.92x (?)**   | | | | **Improvement** | **OLD** | **NEW** | **DELTA** | **RATIO** ObjectiveCBridgeStubToNSDate2 | 680 | 620 | -8.8% | **1.10x (?)** Data.hash.Small | 348 | 324 | -6.9% | **1.07x (?)** ### Code size: -swiftlibs <details> <summary><strong>How to read the data</strong></summary> The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%. If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR. **Noise**: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein). </details> <details> <summary><strong>Hardware Overview</strong></summary> Model Name: Mac Pro Model Identifier: MacPro6,1 Processor Name: 12-Core Intel Xeon E5 Processor Speed: 2.7 GHz Number of Processors: 1 Total Number of Cores: 12 L2 Cache (per Core): 256 KB L3 Cache: 30 MB Memory: 64 GB </details>

And last, here's the most recent PR I created. This PR will give us a benchmark of some baseline improvements we will (hopefully) get by incorporating the "unique" information into various access optimization passes. The goal of the PR is to get a benchmark, not to commit anything. So there's a lot going on.

github.com/apple/swift

[WIP] [NO-MERGE] Make "unique" flag more robust and update access enforcement opts to make use of it.

apple:master ← zoecarver:opt/unique-reference-access

opened 07:18PM - 14 Jul 20 UTC

zoecarver

+689 -94

This is a WIP PR to get a benchmark of the performance improvements of access en…forcement opts when it promotes uniquely identified storage accesses to statically enforced accesses. I'm not sure to what degree `AccessedStorage::isUniquelyIdentified` is used throughout the optimizer. I wouldn't be supprised if this PR doesn't have any meaningful performance gains. I think another pass that just updated accesses of "unique" references might be much more beneficial (performance-wise). We'll see. Later, I'll clean up these individual commits and break each one into its own PR. The tests do not currently pass, this PR is just for benchmarking. Refs (and based on) #32844. [Related forum post](https://forums.swift.org/t/adding-a-unique-flag-to-alloc-ref/38398).

I suspect there are some decent motivating examples, so assuming you can uncover those

Yes, I can work to uncover these. The above PR will hopefully show the performance improvements from updating AccessedStorage::isUniquelyIdentified. I suspect we could get even better results with a custom pass, though. As you said, there are also other improvements here beyond performance. I.e., static access errors instead or runtime errors.

Structural SIL properties should be computed on-the-fly if possible as long as that doesn't increase algorithmic complexity. For example, if discovering a property only requires walking the use-def chain without handling phis, then that should always be done on-the-fly. The proposed "unique" property requires analyzing all uses if the alloc_ref. That certainly could be done on the fly, but I have to admit, caching this in a flag will be cheaper, and I do want identifying AccessStorage to be as cheap as possible--it's already used within quadratic algorithms.

To be sure a reference is actually unique, I think we have to use a quadratic algorithm, we have to evaluate every use and every use of that use and so on and so forth. So, I think the performance issue is actually quite an important one to keep in mind.

Another thing that is really neat about this implementation is that because it doesn't rely on any analysis utils, it can run in the SILVerifier. This means it can run after every pass which makes debugging extremely easy.

The flag is low-maintenance.

Additionally, if we wanted to remove it for some reason, I don't think there would be much cost to simply "turning it off."

After thinking about this more, there's another place that could have huge performance benefits from this feature. As you mentioned, the CoW "problem" discussed in:

could potentially be resolved using this flag.

Yesterday and today I spent a bit of time playing around with the example in that post to try to figure out if I could use this flag to resolve the issue. I was actually able to get rid of all the copies with:

and manual inlining. But, I suspect more complicated examples of the problem that aren't able to be optimized away could be solved by using this flag to essentially say, "if the source argument is unique, don't make a copy here." We might have to have a way to expose this on the language level, though, and that may not be the best. Especially considering that ownership changes (move only types, etc.) are (hopefully) just around the corner.