Performance issues in automatic reference counting (ARC)?

Brian_Gesiak · December 17, 2016, 7:13pm

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <
http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf>\.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the need
for atomic increment/decrements, which can be slow." [...] "The performance
problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I
guess atomic reference count changes create overhead in multithreaded
applications? Are there more detailed explorations into this topic?

Thanks!

- Brian Gesiak

rudkx · December 17, 2016, 7:47pm

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf>\.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the need for atomic increment/decrements, which can be slow." [...] "The performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems?

This might be a good starter post: Atomic operations and contention | The ryg blog

I guess atomic reference count changes create overhead in multithreaded applications?

There is overhead even in single-threaded applications, as the post above discusses. The overhead depends very much on the details of the particular microarchitecture.

Are there more detailed explorations into this topic?

If you’re looking for something Swift- or ARC-specific I don’t know of anything.

Mark

···

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev <swift-dev@swift.org> wrote:

Michael_Gottesman · December 17, 2016, 8:49pm

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf>\.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the need for atomic increment/decrements, which can be slow." [...] "The performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I guess atomic reference count changes create overhead in multithreaded applications? Are there more detailed explorations into this topic?

With a proper concurrency model, I believe you can make most reference counting operations local (my opinion). I have done some explorations in this area in the past using what I call thread local vs global reference counts and using marked concurrency boundaries to mediate transitions in between them (moving from thread local -> atomic of course if one escapes in an undefined way).

If you are interested in the perf difference with ARC atomics, Roman recently added a mode to the compiler called -assume-single-threaded that uses non-atomic reference counts anywhere.

There are some interesting optimizations in this area as well, specifically even today, COW gives a nice guarantee of thread localness allowing you to eliminate atomic reference counts once you have a uniqued cow data structure.

Michael

···

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev <swift-dev@swift.org> wrote:

Thanks!

- Brian Gesiak

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Brian_Gesiak · December 17, 2016, 8:10pm

Great, thanks! That article was exactly what I had in mind. :)

- Brian Gesiak

···

On Sat, Dec 17, 2016 at 2:47 PM, Mark Lacey <mark.lacey@apple.com> wrote:

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev < > swift-dev@swift.org> wrote:

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <
http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf
>.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the
need for atomic increment/decrements, which can be slow." [...] "The
performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems?

This might be a good starter post: https://fgiesen.
wordpress.com/2014/08/18/atomics-and-contention

I guess atomic reference count changes create overhead in multithreaded
applications?

There is overhead even in single-threaded applications, as the post above
discusses. The overhead depends very much on the details of the particular
microarchitecture.

Are there more detailed explorations into this topic?

If you’re looking for something Swift- or ARC-specific I don’t know of
anything.

Mark

Mikio_Takeuchi · January 31, 2017, 8:07am

Hi Michael,

If you are interested in the perf difference with ARC atomics, Roman

recently added a mode to the compiler called -assume-single-threaded that
uses non-atomic reference counts anywhere.

I think that is not exactly true. As of now, -assume-single-threaded
option can eliminate atomic reference counts only for reference types.

I tried -assume-single-threaded for compiling applications as well as swift
runtime, and found that atomic reference counts were still used for value
types and improvements were limited because of them.

SIL Instructions on value types (such as CopyValue) are not subtype of
RefCountingInst, therefore they don't have a mechanism to represent
atomicity. COW requires reference counts and, because of the lack of
information, atomic reference counts are assumed at many places in their
codegen and runtime.

I made a prototype which returns the right atomicity based on the compiler
option in order to eliminate atomic reference counts from generated code. I
also modified value witness functions to eliminate atomic reference counts
from them. With these changes, atomic reference counts almost disappeared.

If it makes sense, I am happy to contribute my changes to the community.

I understand there are two problems with my prototype.
1) We may need to introduce a mechanism to represent atomicity for value
types as well. It will open an opportunity for compiler to use non-atomic
reference counts for thread-local values.
2) We need to either extend value witness table to add non-atomic version
of functions, or pass atomicity information to the functions as an extra
parameter.

Since they are not trivial changes, I would like your endorsement before
starting serious work.

-- Mikio

···

2016-12-18 5:49 GMT+09:00 Michael Gottesman via swift-dev < swift-dev@swift.org>:

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev < > swift-dev@swift.org> wrote:

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <
http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf
>.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the
need for atomic increment/decrements, which can be slow." [...] "The
performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I
guess atomic reference count changes create overhead in multithreaded
applications? Are there more detailed explorations into this topic?

With a proper concurrency model, I believe you can make most reference
counting operations local (my opinion). I have done some explorations in
this area in the past using what I call thread local vs global reference
counts and using marked concurrency boundaries to mediate transitions in
between them (moving from thread local -> atomic of course if one escapes
in an undefined way).

If you are interested in the perf difference with ARC atomics, Roman
recently added a mode to the compiler called -assume-single-threaded that
uses non-atomic reference counts anywhere.

There are some interesting optimizations in this area as well,
specifically even today, COW gives a nice guarantee of thread localness
allowing you to eliminate atomic reference counts once you have a uniqued
cow data structure.

Michael

Thanks!

- Brian Gesiak

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Roman_Levenstein · January 31, 2017, 5:02pm

Hi Mikio,

Hi Michael,

> If you are interested in the perf difference with ARC atomics, Roman recently added a mode to the compiler called -assume-single-threaded that uses non-atomic reference counts anywhere.

I think that is not exactly true. As of now, -assume-single-threaded option can eliminate atomic reference counts only for reference types.

I tried -assume-single-threaded for compiling applications as well as swift runtime, and found that atomic reference counts were still used for value types and improvements were limited because of them.

That’s correct. -assume-single-threaded does not cover all cases. I think it is even mentioned somewhere in the comments. It was meant as a quick hack to allow some experiments in this direction.

SIL Instructions on value types (such as CopyValue) are not subtype of RefCountingInst, therefore they don't have a mechanism to represent atomicity. COW requires reference counts and, because of the lack of information, atomic reference counts are assumed at many places in their codegen and runtime.

I made a prototype which returns the right atomicity based on the compiler option in order to eliminate atomic reference counts from generated code. I also modified value witness functions to eliminate atomic reference counts from them. With these changes, atomic reference counts almost disappeared.

Cool! Thanks for working on this!

Could you report some preliminary performance improvement due to these changes? It would be interesting to see the comparison with the vanilla Swift compiler, the existing -assume-single-threaded option and your improvements.

If it makes sense, I am happy to contribute my changes to the community.

I understand there are two problems with my prototype.
1) We may need to introduce a mechanism to represent atomicity for value types as well. It will open an opportunity for compiler to use non-atomic reference counts for thread-local values.
2) We need to either extend value witness table to add non-atomic version of functions, or pass atomicity information to the functions as an extra parameter.

Since they are not trivial changes, I would like your endorsement before starting serious work.

It would be interesting to see your code in any case. Could you share a link to the branch with your changes?

Regarding the question about the future work on the prototype, let’s first see the code and taken implementation approach. Then we could discuss its design. etc.

-Roman

···

On Jan 31, 2017, at 12:07 AM, Mikio Takeuchi via swift-dev <swift-dev@swift.org> wrote:

-- Mikio

2016-12-18 5:49 GMT+09:00 Michael Gottesman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>>:

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf>\.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the need for atomic increment/decrements, which can be slow." [...] "The performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I guess atomic reference count changes create overhead in multithreaded applications? Are there more detailed explorations into this topic?

With a proper concurrency model, I believe you can make most reference counting operations local (my opinion). I have done some explorations in this area in the past using what I call thread local vs global reference counts and using marked concurrency boundaries to mediate transitions in between them (moving from thread local -> atomic of course if one escapes in an undefined way).

If you are interested in the perf difference with ARC atomics, Roman recently added a mode to the compiler called -assume-single-threaded that uses non-atomic reference counts anywhere.

There are some interesting optimizations in this area as well, specifically even today, COW gives a nice guarantee of thread localness allowing you to eliminate atomic reference counts once you have a uniqued cow data structure.

Michael

Thanks!

- Brian Gesiak

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Michael_Gottesman · January 31, 2017, 6:32pm

Hi Michael,

> If you are interested in the perf difference with ARC atomics, Roman recently added a mode to the compiler called -assume-single-threaded that uses non-atomic reference counts anywhere.

I think that is not exactly true. As of now, -assume-single-threaded option can eliminate atomic reference counts only for reference types.

I tried -assume-single-threaded for compiling applications as well as swift runtime, and found that atomic reference counts were still used for value types and improvements were limited because of them.

SIL Instructions on value types (such as CopyValue) are not subtype of RefCountingInst, therefore they don't have a mechanism to represent atomicity.

This is not an issue since currently copy value is lowered right after SILGen to instructions that /do/ have atomicity. My understanding is that the assume single threaded option runs a pass late to set these all to non-atomic.

COW requires reference counts and, because of the lack of information, atomic reference counts are assumed at many places in their codegen and runtime.

IMO again, this is not an issue due to the late pass.

I made a prototype which returns the right atomicity based on the compiler option in order to eliminate atomic reference counts from generated code. I also modified value witness functions to eliminate atomic reference counts from them. With these changes, atomic reference counts almost disappeared.

The value witness functions I think are the larger potential issue.

If it makes sense, I am happy to contribute my changes to the community.

I understand there are two problems with my prototype.
1) We may need to introduce a mechanism to represent atomicity for value types as well. It will open an opportunity for compiler to use non-atomic reference counts for thread-local values.

Again, I do not think that is an issue since we lower these away today. This may require changes at a later time though once these value operations go further back into the compiler.

2) We need to either extend value witness table to add non-atomic version of functions, or pass atomicity information to the functions as an extra parameter.

Since your option makes an assumption that the whole program is single threaded, why couldn't we just emit the value witness functions such that they use non-atomic reference counts?

Since they are not trivial changes, I would like your endorsement before starting serious work.

Send a PR and lets talk about it. [i.e. what Roman said ; )]

···

On Jan 31, 2017, at 12:07 AM, Mikio Takeuchi <mikio.takeuchi@gmail.com> wrote:

-- Mikio

2016-12-18 5:49 GMT+09:00 Michael Gottesman via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>>:

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf>\.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the need for atomic increment/decrements, which can be slow." [...] "The performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I guess atomic reference count changes create overhead in multithreaded applications? Are there more detailed explorations into this topic?

With a proper concurrency model, I believe you can make most reference counting operations local (my opinion). I have done some explorations in this area in the past using what I call thread local vs global reference counts and using marked concurrency boundaries to mediate transitions in between them (moving from thread local -> atomic of course if one escapes in an undefined way).

If you are interested in the perf difference with ARC atomics, Roman recently added a mode to the compiler called -assume-single-threaded that uses non-atomic reference counts anywhere.

There are some interesting optimizations in this area as well, specifically even today, COW gives a nice guarantee of thread localness allowing you to eliminate atomic reference counts once you have a uniqued cow data structure.

Michael

Thanks!

- Brian Gesiak

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

Mikio_Takeuchi · February 1, 2017, 10:45am

Hi Roman and Michael,

Thanks for your interest ! In order not to bother anyone else with a giant
mail, I will send you my preliminary results in a separate mail.

My prototype uses slightly old code base (last December), so some problems
may have already resolved. It already has the second run of the
AssumeSingleThreaded path at the very end of transformation, however I have
not verified it was effective.

I will share my prototype once I get an approval from my employer.

Thanks,
-- Mikio

···

2017-02-01 2:02 GMT+09:00 Roman Levenstein <rlevenstein@apple.com>:

Hi Mikio,

On Jan 31, 2017, at 12:07 AM, Mikio Takeuchi via swift-dev < > swift-dev@swift.org> wrote:

Hi Michael,

> If you are interested in the perf difference with ARC atomics, Roman
recently added a mode to the compiler called -assume-single-threaded that
uses non-atomic reference counts anywhere.

I think that is not exactly true. As of now, -assume-single-threaded
option can eliminate atomic reference counts only for reference types.

I tried -assume-single-threaded for compiling applications as well as
swift runtime, and found that atomic reference counts were still used for
value types and improvements were limited because of them.

That’s correct. -assume-single-threaded does not cover all cases. I
think it is even mentioned somewhere in the comments. It was meant as a
quick hack to allow some experiments in this direction.

SIL Instructions on value types (such as CopyValue) are not subtype of
RefCountingInst, therefore they don't have a mechanism to represent
atomicity. COW requires reference counts and, because of the lack of
information, atomic reference counts are assumed at many places in their
codegen and runtime.

I made a prototype which returns the right atomicity based on the compiler
option in order to eliminate atomic reference counts from generated code. I
also modified value witness functions to eliminate atomic reference counts
from them. With these changes, atomic reference counts almost disappeared.

Cool! Thanks for working on this!

Could you report some preliminary performance improvement due to these
changes? It would be interesting to see the comparison with the vanilla
Swift compiler, the existing -assume-single-threaded option and your
improvements.

If it makes sense, I am happy to contribute my changes to the community.

I understand there are two problems with my prototype.
1) We may need to introduce a mechanism to represent atomicity for value
types as well. It will open an opportunity for compiler to use non-atomic
reference counts for thread-local values.
2) We need to either extend value witness table to add non-atomic version
of functions, or pass atomicity information to the functions as an extra
parameter.

Since they are not trivial changes, I would like your endorsement before
starting serious work.

It would be interesting to see your code in any case. Could you share a
link to the branch with your changes?

Regarding the question about the future work on the prototype, let’s first
see the code and taken implementation approach. Then we could discuss its
design. etc.

-Roman

-- Mikio

2016-12-18 5:49 GMT+09:00 Michael Gottesman via swift-dev <
swift-dev@swift.org>:

On Dec 17, 2016, at 11:13 AM, Brian Gesiak via swift-dev < >> swift-dev@swift.org> wrote:

Hello all!

I really enjoyed Chris Lattner's slides from his talk at IBM <
http://researcher.watson.ibm.com/researcher/files/us-lmandel/lattner.pdf
>.

The speaker notes mention ARC:

"There are two principle downsides to ARC that people cite: one is the
need for atomic increment/decrements, which can be slow." [...] "The
performance problems it can cause are real in some important cases"

Can someone point me to a good resource that explains these problems? I
guess atomic reference count changes create overhead in multithreaded
applications? Are there more detailed explorations into this topic?

With a proper concurrency model, I believe you can make most reference
counting operations local (my opinion). I have done some explorations in
this area in the past using what I call thread local vs global reference
counts and using marked concurrency boundaries to mediate transitions in
between them (moving from thread local -> atomic of course if one escapes
in an undefined way).

If you are interested in the perf difference with ARC atomics, Roman
recently added a mode to the compiler called -assume-single-threaded that
uses non-atomic reference counts anywhere.

There are some interesting optimizations in this area as well,
specifically even today, COW gives a nice guarantee of thread localness
allowing you to eliminate atomic reference counts once you have a uniqued
cow data structure.

Michael

Thanks!

- Brian Gesiak

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev