Performance Refinement of Data


(Philippe Hausler) #1

Performance Refinement of Data
Introduction
In Swift 3 the Foundation team introduced a new structural type to represent NSData exposed into Swift.

Data allows developers to interact with binary data with value semantics, which is often more appropriate than using pointers like UnsafeMutablePointer. Having an encapsulating type to abstract the common operations is often quite advantageous to tasks like parsing, where types like String may not be appropriate for or have hidden performance "gotchas".

Data can easily be a critical point of performance for many tasks. Data is an appropriate common currency of transacting a safe managed buffer of bytes that interoperates well with existing Cocoa APIs. This means that it should be tuned for performance as much as possible.

Motivation
There are several outstanding performance improvements which can be made to Data; issues regarding Data's performance have been raised with the Foundation team both publicly and privately, and we would like to address those.

Data should be as fast as possible. Currently, most of the backing of Data is implemented in Foundation, while being quite fast for reallocations and other operations, this means that calls are made between Swift and Objective-C even for simple things like count for every access.

This Swift–Objective-C boundary means that no inlining can be performed across it; even when we have full control over the backing implementation and the caller, what would normally be just a single offset load instructions ends up becoming many just for the benefit of an objc_msgSend (not to mention ARC operations). Even though the calls are heavily optimized, they will never be as fast as a single instruction.

Proposed solution
In order to make Data as fast as possible the implementation needs to be inlined; since that is one of the best performance optimizations that Swift can offer. To do this it requires a re-think of how Data and NSData interact. This means that the structure Data will need to adopt certain attributes that will allow it to be inlined into the call-sites. The function dispatch overhead will be reduced but also optimization passes like cold paths and branch elimination can be possibilities for the compiler to do the best possible thing for the call site.

Data will adopt the annotation @inline(__always) in key locations and use a non-Objective-C backing class to store the pointer and length (and other internal ivars). That backing object will allow for a reduction of capture points as well as avoid extra retain/releases that were added for mutation detection.

Instead of using _MutablePairBoxing (which uses closures to map invocations to references or apply mutations with copy on write semantics) the new backing implementation can easily be applied with copy on write semantics without any Unmanaged "dancing". That "dancing" complicates code and it can be made considerably simpler. Furthermore, avoiding this "dance" can reduce the calls to retain and release down to zero in the application of mutations in unique referenced cases as well as mapping non mutating invocations to backing storage.

Subclassing the reference type NSData is still something that Foundation should support for the wrapping of the reference type in a structure. This effectively means there are five types of backing for Data: Swift-implemented, immutable NSData, mutable NSMutableData, custom subclasses of NSData, and custom subclasses of NSMutableData. These specific cases are delineated to offer the most efficient inline cases possible.

Since Foundation can enforce a no dynamic dispatch needed contract with itself in the cases of the standard class cluster members of NSData and NSMutableData Foundation can assure these cases are acceptable to not need dynamic dispatch for every time bytes or length are accessed and the values can be cached until the data is mutated or disposed of. In the cases where a subclass is used of course all bets are off and every point requires dynamically calling out.

In short this will mean that fetching the count of a Data can be optimized to a single branch and load from an offset and this same optimization can be applied to many other methods on Data.

Bridging to and from Objective-C
Many of the sources that Data is derived from are sourced from the SDKs written in Objective-C. For other types like Array,Set, Dictionary, or String the objects returned are many times not very large. Arrays may have a handful of objects, strings may only be a few hundred characters and so on. In these small cases it makes sense to "eagerly" bridge those reference types into a more inline-able version (there are exclusions to this but in general it is most often the case).

Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large) or results from network transactions of downloads. These cases would definitely suffer from having an "eager" O(n) bridge; due to not only memory allocation duplications to hold both backing stores but also to the byte copy from the reference type to the value type. Data should be fast no matter where it came from unless it is truly unknown on it's dynamic dispatch requirements.

To build a Data that is fast for inline optimizations the bytes pointer and length need to be cached for the duration of the object. When as is used to bridge a custom reference to Data dynamic dispatch must occur on every call to count and every time bytes are touched but if the Data is known to be obtained from a source that we can control the dynamic dispatch expectations that dynamic dispatch can be elided and behavior can be preserved by mimicking the Objective-C implementation in Swift.

Bridging in the other direction also has some distinct performance optimizations that can be taken advantage of as well.

When the lifespan of the callout to Objective-C is well known the cases of Swift constructed Data can easily pass a NSData with a no-copy of the backing buffer. It is the responsibility of the Objective-C APIs to appropriately either not directly retain past the scope of the call or copy in the cases of long lasting references. Any Objective-C method or function that takes a NSData and just retains or unsafely stores it past the function callout is likely incorrect and has bugs no matter the language it was invoked in. This case where the Data is created in Swift to bridge it only needs to allocate the wrapper NSData but no O(n) copy needs to occur (unless it is holding a reference as previously stated).

The final case of bridging is when a Data is obtained from Objective-C and then passed directly back to Objective-C. The compiler has potentials of optimizations in direct callout cases such as returnsAData() as NSData with "peephole" optimizations but these are only really enforceable in limited scope (sometimes just the same line of code). Since the backing store can hold a reference to the reference type the bridge method (when not mutated) in those cases can pass that back over to Objective-C. For mutated versions a copy of that mutated version can be passed along as well (taking advantage of any optimizations the dynamic dispatch affords for calls to copy).

Detailed performance breakdown
Each graph below is a comparison between the Swift 3 Data and the new version of Data for each of the inline cases. The horizontal axis in each graph represent N and the vertical axis in each graph represents the sampled duration in nanoseconds. Each data set in the plots are an average over 100 (unless otherwise specified) per value of N. The attached graphs were generated from optimized builds on a Mac Pro (Late 2013) 3.5 GHz 6-Core Intel Xeon E5 with 16 GB 1866 MHz DDR3.

func createSampleData(ofLength N: Int) -> Data {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Data in
        arc4random_buf(buffer.baseAddress!, N)
        return Data(bytes: buffer.baseAddress!, count: N)
    }
}

func createSampleDataReference(ofLength N: Int) -> NSData {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> NSData in
        arc4random_buf(buffer.baseAddress!, N)
        return NSData(bytes: buffer.baseAddress, length: N)
    }
}

func createSampleArray(ofLength N: Int) -> [UInt8] {
    var buffer = [UInt8](repeating: 0, count: N)
    buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Void in
        arc4random_buf(buffer.baseAddress!, N)
    }
    return buffer
}
Accessing count

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data.count
// end measuring

Subscripting

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data[index]
// end measuring

// setup
var data = createSampleData(ofLength: N)
// start measuring
data[index] = 0x00
// end measuring

Appending

This should be a O(N) operation

// setup
let dataToAppend = createSampleData(ofLength: N)
var data = Data()
// start measuring
data.append(dataToAppend)
// end measuring

// setup
let arrayToAppend = createSampleArray(ofLength: N)
var data = Data()
// start measuring
data.append(contentsOf: arrayToAppend)
// end measuring

The new version is still O(N) just a much smaller constant multiplier.

var data = Data()
// start measuring
for _ in 0..<N {
    data.append(contentsOf: [0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA])
}
//end measuring

Replacing sub ranges

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<N, with: replacement)
// end measuring

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<min(N, 5), with: replacement)
// end measuring

Growth of count

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

// setup
var data = Data()
data.count = starting
// start measuring
data.count = N
// end measuring

Bridging to reference types

This should be a O(1) operation. In bridging to a reference case the previous implementation was a bit faster. The only real extra overhead here is an allocation of the NSData object since the Swift backed Data has no existing reference type to pass along. There are a few extra optimizations that can be done in this path to reduce it by the approximately 150 nanosecond difference. In practice the cases where Data is being bridged back out to Objective-C are usually cases like writing to a file or socket which dwarf that 150 nanosecond differential.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data as NSData
// end measuring

Bridging from reference types

This should be a O(1) operation

// setup
let data = createSampleDataReference(ofLength: N)
// start measuring
_ = data as Data
// end measuring


Performance issues with Foundation.Data
(Alex Blewitt) #2

This might be of interest to the swift-server-dev community who aren't on either of the other lists.

Alex

···

Begin forwarded message:

From: Philippe Hausler via swift-dev <swift-dev@swift.org>
Subject: [swift-dev] Performance Refinement of Data
Date: 19 November 2016 at 19:10:32 GMT
To: swift-dev@swift.org, swift-corelibs-dev@swift.org
Reply-To: Philippe Hausler <phausler@apple.com>

Performance Refinement of Data
Introduction
In Swift 3 the Foundation team introduced a new structural type to represent NSData exposed into Swift.

Data allows developers to interact with binary data with value semantics, which is often more appropriate than using pointers like UnsafeMutablePointer. Having an encapsulating type to abstract the common operations is often quite advantageous to tasks like parsing, where types like String may not be appropriate for or have hidden performance "gotchas".

Data can easily be a critical point of performance for many tasks. Data is an appropriate common currency of transacting a safe managed buffer of bytes that interoperates well with existing Cocoa APIs. This means that it should be tuned for performance as much as possible.

Motivation
There are several outstanding performance improvements which can be made to Data; issues regarding Data's performance have been raised with the Foundation team both publicly and privately, and we would like to address those.

Data should be as fast as possible. Currently, most of the backing of Data is implemented in Foundation, while being quite fast for reallocations and other operations, this means that calls are made between Swift and Objective-C even for simple things like count for every access.

This Swift–Objective-C boundary means that no inlining can be performed across it; even when we have full control over the backing implementation and the caller, what would normally be just a single offset load instructions ends up becoming many just for the benefit of an objc_msgSend (not to mention ARC operations). Even though the calls are heavily optimized, they will never be as fast as a single instruction.

Proposed solution
In order to make Data as fast as possible the implementation needs to be inlined; since that is one of the best performance optimizations that Swift can offer. To do this it requires a re-think of how Data and NSData interact. This means that the structure Data will need to adopt certain attributes that will allow it to be inlined into the call-sites. The function dispatch overhead will be reduced but also optimization passes like cold paths and branch elimination can be possibilities for the compiler to do the best possible thing for the call site.

Data will adopt the annotation @inline(__always) in key locations and use a non-Objective-C backing class to store the pointer and length (and other internal ivars). That backing object will allow for a reduction of capture points as well as avoid extra retain/releases that were added for mutation detection.

Instead of using _MutablePairBoxing (which uses closures to map invocations to references or apply mutations with copy on write semantics) the new backing implementation can easily be applied with copy on write semantics without any Unmanaged "dancing". That "dancing" complicates code and it can be made considerably simpler. Furthermore, avoiding this "dance" can reduce the calls to retain and release down to zero in the application of mutations in unique referenced cases as well as mapping non mutating invocations to backing storage.

Subclassing the reference type NSData is still something that Foundation should support for the wrapping of the reference type in a structure. This effectively means there are five types of backing for Data: Swift-implemented, immutable NSData, mutable NSMutableData, custom subclasses of NSData, and custom subclasses of NSMutableData. These specific cases are delineated to offer the most efficient inline cases possible.

Since Foundation can enforce a no dynamic dispatch needed contract with itself in the cases of the standard class cluster members of NSData and NSMutableData Foundation can assure these cases are acceptable to not need dynamic dispatch for every time bytes or length are accessed and the values can be cached until the data is mutated or disposed of. In the cases where a subclass is used of course all bets are off and every point requires dynamically calling out.

In short this will mean that fetching the count of a Data can be optimized to a single branch and load from an offset and this same optimization can be applied to many other methods on Data.

Bridging to and from Objective-C
Many of the sources that Data is derived from are sourced from the SDKs written in Objective-C. For other types like Array,Set, Dictionary, or String the objects returned are many times not very large. Arrays may have a handful of objects, strings may only be a few hundred characters and so on. In these small cases it makes sense to "eagerly" bridge those reference types into a more inline-able version (there are exclusions to this but in general it is most often the case).

Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large) or results from network transactions of downloads. These cases would definitely suffer from having an "eager" O(n) bridge; due to not only memory allocation duplications to hold both backing stores but also to the byte copy from the reference type to the value type. Data should be fast no matter where it came from unless it is truly unknown on it's dynamic dispatch requirements.

To build a Data that is fast for inline optimizations the bytes pointer and length need to be cached for the duration of the object. When as is used to bridge a custom reference to Data dynamic dispatch must occur on every call to count and every time bytes are touched but if the Data is known to be obtained from a source that we can control the dynamic dispatch expectations that dynamic dispatch can be elided and behavior can be preserved by mimicking the Objective-C implementation in Swift.

Bridging in the other direction also has some distinct performance optimizations that can be taken advantage of as well.

When the lifespan of the callout to Objective-C is well known the cases of Swift constructed Data can easily pass a NSData with a no-copy of the backing buffer. It is the responsibility of the Objective-C APIs to appropriately either not directly retain past the scope of the call or copy in the cases of long lasting references. Any Objective-C method or function that takes a NSData and just retains or unsafely stores it past the function callout is likely incorrect and has bugs no matter the language it was invoked in. This case where the Data is created in Swift to bridge it only needs to allocate the wrapper NSData but no O(n) copy needs to occur (unless it is holding a reference as previously stated).

The final case of bridging is when a Data is obtained from Objective-C and then passed directly back to Objective-C. The compiler has potentials of optimizations in direct callout cases such as returnsAData() as NSData with "peephole" optimizations but these are only really enforceable in limited scope (sometimes just the same line of code). Since the backing store can hold a reference to the reference type the bridge method (when not mutated) in those cases can pass that back over to Objective-C. For mutated versions a copy of that mutated version can be passed along as well (taking advantage of any optimizations the dynamic dispatch affords for calls to copy).

Detailed performance breakdown
Each graph below is a comparison between the Swift 3 Data and the new version of Data for each of the inline cases. The horizontal axis in each graph represent N and the vertical axis in each graph represents the sampled duration in nanoseconds. Each data set in the plots are an average over 100 (unless otherwise specified) per value of N. The attached graphs were generated from optimized builds on a Mac Pro (Late 2013) 3.5 GHz 6-Core Intel Xeon E5 with 16 GB 1866 MHz DDR3.

func createSampleData(ofLength N: Int) -> Data {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Data in
        arc4random_buf(buffer.baseAddress!, N)
        return Data(bytes: buffer.baseAddress!, count: N)
    }
}

func createSampleDataReference(ofLength N: Int) -> NSData {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> NSData in
        arc4random_buf(buffer.baseAddress!, N)
        return NSData(bytes: buffer.baseAddress, length: N)
    }
}

func createSampleArray(ofLength N: Int) -> [UInt8] {
    var buffer = [UInt8](repeating: 0, count: N)
    buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Void in
        arc4random_buf(buffer.baseAddress!, N)
    }
    return buffer
}
Accessing count

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data.count
// end measuring

Subscripting

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data[index]
// end measuring

// setup
var data = createSampleData(ofLength: N)
// start measuring
data[index] = 0x00
// end measuring

Appending

This should be a O(N) operation

// setup
let dataToAppend = createSampleData(ofLength: N)
var data = Data()
// start measuring
data.append(dataToAppend)
// end measuring

// setup
let arrayToAppend = createSampleArray(ofLength: N)
var data = Data()
// start measuring
data.append(contentsOf: arrayToAppend)
// end measuring

The new version is still O(N) just a much smaller constant multiplier.

var data = Data()
// start measuring
for _ in 0..<N {
    data.append(contentsOf: [0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA])
}
//end measuring

Replacing sub ranges

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<N, with: replacement)
// end measuring

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<min(N, 5), with: replacement)
// end measuring

Growth of count

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

// setup
var data = Data()
data.count = starting
// start measuring
data.count = N
// end measuring

Bridging to reference types

This should be a O(1) operation. In bridging to a reference case the previous implementation was a bit faster. The only real extra overhead here is an allocation of the NSData object since the Swift backed Data has no existing reference type to pass along. There are a few extra optimizations that can be done in this path to reduce it by the approximately 150 nanosecond difference. In practice the cases where Data is being bridged back out to Objective-C are usually cases like writing to a file or socket which dwarf that 150 nanosecond differential.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data as NSData
// end measuring

Bridging from reference types

This should be a O(1) operation

// setup
let data = createSampleDataReference(ofLength: N)
// start measuring
_ = data as Data
// end measuring
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Helge Heß) #3

This might be of interest to the swift-server-dev community who aren't on either of the other lists.

Thanks.

Hm, I don’t know. Something I don’t quite get is why this is even required. What is really different between that `Data` and a `[ UInt8 ]`? Just the bridging to NSData (which we don’t have on Linux), and couldn’t all that bridging stuff described be done against a special [UInt8] array-backend class too?

The performance patterns attached are focused around a single data object. Also: "Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large)”.
For the server I think the focus needs to be quite different.

So what are the requirements for a ‘byte bucket’ data structure on the server. Some stuff I can think of:

- There are a lot (“millions") of read and write calls with
  tons of small byte buffers (essentially socket buffer
  sizes, but when framing has been processed even smaller ones ...)

- Copying of those buffers should be avoided as much as possible
  - Ideally it should be possible to reuse such buffers (instead
    of malloc/free’ing them at a high rate)

- It is rather rare that the buffers in servers are itself
  modified, usually you create a new, transformed buffer (or
  other object, like a String or Int) out of existing ones
  - except for one operation: you often slice them (e.g. when
    encountering a HTTP chunk marker, or some other framing in
    a protocol or data format).
  - that might imply that the CoW overhead - even if small -
    may be unnecessarily high
  - instead there needs to be a focus an dealing with ranges

- It should be highly efficient to create String’s from such
  byte buckets (arrays). while it should be delayed to create
  such to the latest point possible, it quite often will be
  the API the top-level framework user is expecting (JSON,
  query-strings, XML, whatever).

- There needs to be something to represent arrays of those
  buckets, Apache calls them brigades. One could use Arrays
  for such, but maybe they should be lists instead.
  (this is required to avoid copying split buffers into
   larger ones, and is supported by writev/readv)

- Except for some special cases like caches - unlike mentioned
  in the proposal - server apps should rarely ‘source byte
  buckets from disks’.
  Instead of loading the buckets into memory, maybe there
  should be a way to have byte-buckets which are files.
  NSData could do mmap’ed files, what you more often really
  want in the server context is `sendfile()` though.
  - some kind of a ‘FileDescriptorBackedBuffer’ protocol

- Scanning of buffers should be really fast - and by that I
  mean C `index()` like fast, not `for byte in bytes {}` array
  fast. But that is desirable for [UInt8] too :slight_smile:

There is probably more, but I think the stuff above are things that should be addressed :slight_smile:

It may also be worth considering whether that ‘buffer’ data structure should only be able to carry bytes, or other payloads as well (along the lines of http://noze.io/noze4nonnode/#streams). But maybe that is too ambitious for v1 :slight_smile:

hh

P.S.: This may be good reading material: http://www.apachetutor.org/dev/brigades

···

On 28 Nov 2016, at 20:04, Alex Blewitt via swift-server-dev <swift-server-dev@swift.org> wrote:

Alex

Begin forwarded message:

From: Philippe Hausler via swift-dev <swift-dev@swift.org>
Subject: [swift-dev] Performance Refinement of Data
Date: 19 November 2016 at 19:10:32 GMT
To: swift-dev@swift.org, swift-corelibs-dev@swift.org
Reply-To: Philippe Hausler <phausler@apple.com>

Performance Refinement of Data
Introduction
In Swift 3 the Foundation team introduced a new structural type to represent NSData exposed into Swift.

Data allows developers to interact with binary data with value semantics, which is often more appropriate than using pointers like UnsafeMutablePointer. Having an encapsulating type to abstract the common operations is often quite advantageous to tasks like parsing, where types like String may not be appropriate for or have hidden performance "gotchas".

Data can easily be a critical point of performance for many tasks. Data is an appropriate common currency of transacting a safe managed buffer of bytes that interoperates well with existing Cocoa APIs. This means that it should be tuned for performance as much as possible.

Motivation
There are several outstanding performance improvements which can be made to Data; issues regarding Data's performance have been raised with the Foundation team both publicly and privately, and we would like to address those.

Data should be as fast as possible. Currently, most of the backing of Data is implemented in Foundation, while being quite fast for reallocations and other operations, this means that calls are made between Swift and Objective-C even for simple things like count for every access.

This Swift–Objective-C boundary means that no inlining can be performed across it; even when we have full control over the backing implementation and the caller, what would normally be just a single offset load instructions ends up becoming many just for the benefit of an objc_msgSend (not to mention ARC operations). Even though the calls are heavily optimized, they will never be as fast as a single instruction.

Proposed solution
In order to make Data as fast as possible the implementation needs to be inlined; since that is one of the best performance optimizations that Swift can offer. To do this it requires a re-think of how Data and NSData interact. This means that the structure Data will need to adopt certain attributes that will allow it to be inlined into the call-sites. The function dispatch overhead will be reduced but also optimization passes like cold paths and branch elimination can be possibilities for the compiler to do the best possible thing for the call site.

Data will adopt the annotation @inline(__always) in key locations and use a non-Objective-C backing class to store the pointer and length (and other internal ivars). That backing object will allow for a reduction of capture points as well as avoid extra retain/releases that were added for mutation detection.

Instead of using _MutablePairBoxing (which uses closures to map invocations to references or apply mutations with copy on write semantics) the new backing implementation can easily be applied with copy on write semantics without any Unmanaged "dancing". That "dancing" complicates code and it can be made considerably simpler. Furthermore, avoiding this "dance" can reduce the calls to retain and release down to zero in the application of mutations in unique referenced cases as well as mapping non mutating invocations to backing storage.

Subclassing the reference type NSData is still something that Foundation should support for the wrapping of the reference type in a structure. This effectively means there are five types of backing for Data: Swift-implemented, immutable NSData, mutable NSMutableData, custom subclasses of NSData, and custom subclasses of NSMutableData. These specific cases are delineated to offer the most efficient inline cases possible.

Since Foundation can enforce a no dynamic dispatch needed contract with itself in the cases of the standard class cluster members of NSData and NSMutableData Foundation can assure these cases are acceptable to not need dynamic dispatch for every time bytes or length are accessed and the values can be cached until the data is mutated or disposed of. In the cases where a subclass is used of course all bets are off and every point requires dynamically calling out.

In short this will mean that fetching the count of a Data can be optimized to a single branch and load from an offset and this same optimization can be applied to many other methods on Data.

Bridging to and from Objective-C
Many of the sources that Data is derived from are sourced from the SDKs written in Objective-C. For other types like Array,Set, Dictionary, or String the objects returned are many times not very large. Arrays may have a handful of objects, strings may only be a few hundred characters and so on. In these small cases it makes sense to "eagerly" bridge those reference types into a more inline-able version (there are exclusions to this but in general it is most often the case).

Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large) or results from network transactions of downloads. These cases would definitely suffer from having an "eager" O(n) bridge; due to not only memory allocation duplications to hold both backing stores but also to the byte copy from the reference type to the value type. Data should be fast no matter where it came from unless it is truly unknown on it's dynamic dispatch requirements.

To build a Data that is fast for inline optimizations the bytes pointer and length need to be cached for the duration of the object. When as is used to bridge a custom reference to Data dynamic dispatch must occur on every call to count and every time bytes are touched but if the Data is known to be obtained from a source that we can control the dynamic dispatch expectations that dynamic dispatch can be elided and behavior can be preserved by mimicking the Objective-C implementation in Swift.

Bridging in the other direction also has some distinct performance optimizations that can be taken advantage of as well.

When the lifespan of the callout to Objective-C is well known the cases of Swift constructed Data can easily pass a NSData with a no-copy of the backing buffer. It is the responsibility of the Objective-C APIs to appropriately either not directly retain past the scope of the call or copy in the cases of long lasting references. Any Objective-C method or function that takes a NSData and just retains or unsafely stores it past the function callout is likely incorrect and has bugs no matter the language it was invoked in. This case where the Data is created in Swift to bridge it only needs to allocate the wrapper NSData but no O(n) copy needs to occur (unless it is holding a reference as previously stated).

The final case of bridging is when a Data is obtained from Objective-C and then passed directly back to Objective-C. The compiler has potentials of optimizations in direct callout cases such as returnsAData() as NSData with "peephole" optimizations but these are only really enforceable in limited scope (sometimes just the same line of code). Since the backing store can hold a reference to the reference type the bridge method (when not mutated) in those cases can pass that back over to Objective-C. For mutated versions a copy of that mutated version can be passed along as well (taking advantage of any optimizations the dynamic dispatch affords for calls to copy).

Detailed performance breakdown
Each graph below is a comparison between the Swift 3 Data and the new version of Data for each of the inline cases. The horizontal axis in each graph represent N and the vertical axis in each graph represents the sampled duration in nanoseconds. Each data set in the plots are an average over 100 (unless otherwise specified) per value of N. The attached graphs were generated from optimized builds on a Mac Pro (Late 2013) 3.5 GHz 6-Core Intel Xeon E5 with 16 GB 1866 MHz DDR3.

func createSampleData(ofLength N: Int) -> Data {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Data in
        arc4random_buf(buffer.baseAddress!, N)
        return Data(bytes: buffer.baseAddress!, count: N)
    }
}

func createSampleDataReference(ofLength N: Int) -> NSData {
    var buffer = [UInt8](repeating: 0, count: N)
    return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> NSData in
        arc4random_buf(buffer.baseAddress!, N)
        return NSData(bytes: buffer.baseAddress, length: N)
    }
}

func createSampleArray(ofLength N: Int) -> [UInt8] {
    var buffer = [UInt8](repeating: 0, count: N)
    buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Void in
        arc4random_buf(buffer.baseAddress!, N)
    }
    return buffer
}

Accessing count

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data.count
// end measuring

<access_count.png>

Subscripting

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data[index]
// end measuring

<getting_subscript.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data[index] = 0x00
// end measuring

<setting_subscript.png>

Appending

This should be a O(N) operation

// setup
let dataToAppend = createSampleData(ofLength: N)
var data = Data()
// start measuring
data.append(dataToAppend)
// end measuring

<append_n_bytes_with_data.png>

// setup
let arrayToAppend = createSampleArray(ofLength: N)
var data = Data()
// start measuring
data.append(contentsOf: arrayToAppend)
// end measuring

<append_array_of_bytes.png>

The new version is still O(N) just a much smaller constant multiplier.

var data = Data()
// start measuring
for _ in 0..<N {
    data.append(contentsOf: [0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA])
}
//end measuring

<append_n_arrays.png>

Replacing sub ranges

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<N, with: replacement)
// end measuring

<replace_entire_subrange.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<min(N, 5), with: replacement)
// end measuring

<replace_fixed_subrange.png>

Growth of count

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_small.png>

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_large.png>

// setup
var data = Data()
data.count = starting
// start measuring
data.count = N
// end measuring

<grow_from_mid_to_large.png>

Bridging to reference types

This should be a O(1) operation. In bridging to a reference case the previous implementation was a bit faster. The only real extra overhead here is an allocation of the NSData object since the Swift backed Data has no existing reference type to pass along. There are a few extra optimizations that can be done in this path to reduce it by the approximately 150 nanosecond difference. In practice the cases where Data is being bridged back out to Objective-C are usually cases like writing to a file or socket which dwarf that 150 nanosecond differential.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data as NSData
// end measuring

<bridge_to_objectivec.png>

Bridging from reference types

This should be a O(1) operation

// setup
let data = createSampleDataReference(ofLength: N)
// start measuring
_ = data as Data
// end measuring

<bridge_from_objectivec.png>
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-server-dev mailing list
swift-server-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-server-dev


(Philippe Hausler) #4

From the beginning of the re-design part of the thoughts were specifically targeted at tuning for server, desktop and mobile environments.

To touch on a few of the specifics that I was focusing on:

Parsing data should be fast - this means a couple of things should be really quick to access. Namely of which accessing subranges, both single and length based ranges should boil down to a memmove or a load from an offset respectively. Under the hood NSData uses memory move/copy operations to access bytes in the fastest manner; or allow the developer to access the internal bytes pointer directly. The latter is of course dangerous and we have much safer mechanisms in swift to deal with that type of access (via closure etc).

Furthermore another very common task is to build up buffers of data (appending small chunks; buckets in the brigade in the parlance of Apache). There is a caveat here of course; if the end result will not be needed to be contiguous DispatchData may be a better type in the end for this but that of course depends on your usage. There is a reason why we have both contiguous data (NSData) and discontiguous data (_NSDispatchData/OS_dispatch_data/DispatchData). So those portions aside of differing usage patterns I will focus on the contiguous case since that is what my proposal is about. The reallocation phase for NSData and Data are almost identical algorithms that account for failure patterns and also aim to avoid memory fragmentation (a big problem for servers that may not be running with a compacting malloc). One of the key advantages for Data versus [UInt8] is that Array must account for varying strides and potentials of retains/releases whereas Data does not need to do this. I did not attach the benchmarks since they were comparing totally different storage mechanisms but Data is either on par or faster than [UInt8] since it is specialized for unsafe pointer access and never has to store objects or other strides - it is just a safe wrapper for UnsafeMutableRawPointer effectively.

In my iterations of this latest version of Data I used a technique where I had a target assembly execution pattern that I tweaked the code until it came as close as possible to that pattern. This development loop (paired with real world scenarios in my profiling) allowed me to suss out cases where we had a number of cases of extra retains of the backing store (which cause atomic operations) or unneeded copies that occurred when the CoW semantics got confused because of complicated code and abstractions. The new version is not only more maintainable (modulo some repetition that was caused because of attempting to avoid closures unless absolutely needed) but also more grokable IMHO (mileage may vary).

The CoW overhead in the new Data implementation only copies when it absolutely needs to. This means that if in a server scenario it has a payload of Data that is being handed around it never needs to copy the bytes of the data and most of the wrapper structure can be stack allocated. Which I think meets the goal of avoiding copies.

Range slicing is now considerably more efficient (I can build some more graphs illustrating that).

And per the last commentary of byte per byte access: please refer to the subscripting graphs - the new version is about 2x faster on my victim machine (old: ~30ns/call new: ~15ns/call)

Let me know if there are other specific examples of usage for bytes access that I should build up graphs for that will illustrate the performance improvements. Overall I think it is a incremental improvement of performance without needing to change APIs which might be able to be applied to some other key classes to get better performance for iOS/macOS and Linux on both devices as well as servers.

···

On Nov 29, 2016, at 6:12 AM, Helge Heß via swift-server-dev <swift-server-dev@swift.org> wrote:

On 28 Nov 2016, at 20:04, Alex Blewitt via swift-server-dev <swift-server-dev@swift.org> wrote:

This might be of interest to the swift-server-dev community who aren't on either of the other lists.

Thanks.

Hm, I don’t know. Something I don’t quite get is why this is even required. What is really different between that `Data` and a `[ UInt8 ]`? Just the bridging to NSData (which we don’t have on Linux), and couldn’t all that bridging stuff described be done against a special [UInt8] array-backend class too?

The performance patterns attached are focused around a single data object. Also: "Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large)”.
For the server I think the focus needs to be quite different.

So what are the requirements for a ‘byte bucket’ data structure on the server. Some stuff I can think of:

- There are a lot (“millions") of read and write calls with
tons of small byte buffers (essentially socket buffer
sizes, but when framing has been processed even smaller ones ...)

- Copying of those buffers should be avoided as much as possible
- Ideally it should be possible to reuse such buffers (instead
   of malloc/free’ing them at a high rate)

- It is rather rare that the buffers in servers are itself
modified, usually you create a new, transformed buffer (or
other object, like a String or Int) out of existing ones
- except for one operation: you often slice them (e.g. when
   encountering a HTTP chunk marker, or some other framing in
   a protocol or data format).
- that might imply that the CoW overhead - even if small -
   may be unnecessarily high
- instead there needs to be a focus an dealing with ranges

- It should be highly efficient to create String’s from such
byte buckets (arrays). while it should be delayed to create
such to the latest point possible, it quite often will be
the API the top-level framework user is expecting (JSON,
query-strings, XML, whatever).

- There needs to be something to represent arrays of those
buckets, Apache calls them brigades. One could use Arrays
for such, but maybe they should be lists instead.
(this is required to avoid copying split buffers into
  larger ones, and is supported by writev/readv)

- Except for some special cases like caches - unlike mentioned
in the proposal - server apps should rarely ‘source byte
buckets from disks’.
Instead of loading the buckets into memory, maybe there
should be a way to have byte-buckets which are files.
NSData could do mmap’ed files, what you more often really
want in the server context is `sendfile()` though.
- some kind of a ‘FileDescriptorBackedBuffer’ protocol

- Scanning of buffers should be really fast - and by that I
mean C `index()` like fast, not `for byte in bytes {}` array
fast. But that is desirable for [UInt8] too :slight_smile:

There is probably more, but I think the stuff above are things that should be addressed :slight_smile:

It may also be worth considering whether that ‘buffer’ data structure should only be able to carry bytes, or other payloads as well (along the lines of http://noze.io/noze4nonnode/#streams). But maybe that is too ambitious for v1 :slight_smile:

hh

P.S.: This may be good reading material: http://www.apachetutor.org/dev/brigades

Alex

Begin forwarded message:

From: Philippe Hausler via swift-dev <swift-dev@swift.org>
Subject: [swift-dev] Performance Refinement of Data
Date: 19 November 2016 at 19:10:32 GMT
To: swift-dev@swift.org, swift-corelibs-dev@swift.org
Reply-To: Philippe Hausler <phausler@apple.com>

Performance Refinement of Data
Introduction
In Swift 3 the Foundation team introduced a new structural type to represent NSData exposed into Swift.

Data allows developers to interact with binary data with value semantics, which is often more appropriate than using pointers like UnsafeMutablePointer. Having an encapsulating type to abstract the common operations is often quite advantageous to tasks like parsing, where types like String may not be appropriate for or have hidden performance "gotchas".

Data can easily be a critical point of performance for many tasks. Data is an appropriate common currency of transacting a safe managed buffer of bytes that interoperates well with existing Cocoa APIs. This means that it should be tuned for performance as much as possible.

Motivation
There are several outstanding performance improvements which can be made to Data; issues regarding Data's performance have been raised with the Foundation team both publicly and privately, and we would like to address those.

Data should be as fast as possible. Currently, most of the backing of Data is implemented in Foundation, while being quite fast for reallocations and other operations, this means that calls are made between Swift and Objective-C even for simple things like count for every access.

This Swift–Objective-C boundary means that no inlining can be performed across it; even when we have full control over the backing implementation and the caller, what would normally be just a single offset load instructions ends up becoming many just for the benefit of an objc_msgSend (not to mention ARC operations). Even though the calls are heavily optimized, they will never be as fast as a single instruction.

Proposed solution
In order to make Data as fast as possible the implementation needs to be inlined; since that is one of the best performance optimizations that Swift can offer. To do this it requires a re-think of how Data and NSData interact. This means that the structure Data will need to adopt certain attributes that will allow it to be inlined into the call-sites. The function dispatch overhead will be reduced but also optimization passes like cold paths and branch elimination can be possibilities for the compiler to do the best possible thing for the call site.

Data will adopt the annotation @inline(__always) in key locations and use a non-Objective-C backing class to store the pointer and length (and other internal ivars). That backing object will allow for a reduction of capture points as well as avoid extra retain/releases that were added for mutation detection.

Instead of using _MutablePairBoxing (which uses closures to map invocations to references or apply mutations with copy on write semantics) the new backing implementation can easily be applied with copy on write semantics without any Unmanaged "dancing". That "dancing" complicates code and it can be made considerably simpler. Furthermore, avoiding this "dance" can reduce the calls to retain and release down to zero in the application of mutations in unique referenced cases as well as mapping non mutating invocations to backing storage.

Subclassing the reference type NSData is still something that Foundation should support for the wrapping of the reference type in a structure. This effectively means there are five types of backing for Data: Swift-implemented, immutable NSData, mutable NSMutableData, custom subclasses of NSData, and custom subclasses of NSMutableData. These specific cases are delineated to offer the most efficient inline cases possible.

Since Foundation can enforce a no dynamic dispatch needed contract with itself in the cases of the standard class cluster members of NSData and NSMutableData Foundation can assure these cases are acceptable to not need dynamic dispatch for every time bytes or length are accessed and the values can be cached until the data is mutated or disposed of. In the cases where a subclass is used of course all bets are off and every point requires dynamically calling out.

In short this will mean that fetching the count of a Data can be optimized to a single branch and load from an offset and this same optimization can be applied to many other methods on Data.

Bridging to and from Objective-C
Many of the sources that Data is derived from are sourced from the SDKs written in Objective-C. For other types like Array,Set, Dictionary, or String the objects returned are many times not very large. Arrays may have a handful of objects, strings may only be a few hundred characters and so on. In these small cases it makes sense to "eagerly" bridge those reference types into a more inline-able version (there are exclusions to this but in general it is most often the case).

Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large) or results from network transactions of downloads. These cases would definitely suffer from having an "eager" O(n) bridge; due to not only memory allocation duplications to hold both backing stores but also to the byte copy from the reference type to the value type. Data should be fast no matter where it came from unless it is truly unknown on it's dynamic dispatch requirements.

To build a Data that is fast for inline optimizations the bytes pointer and length need to be cached for the duration of the object. When as is used to bridge a custom reference to Data dynamic dispatch must occur on every call to count and every time bytes are touched but if the Data is known to be obtained from a source that we can control the dynamic dispatch expectations that dynamic dispatch can be elided and behavior can be preserved by mimicking the Objective-C implementation in Swift.

Bridging in the other direction also has some distinct performance optimizations that can be taken advantage of as well.

When the lifespan of the callout to Objective-C is well known the cases of Swift constructed Data can easily pass a NSData with a no-copy of the backing buffer. It is the responsibility of the Objective-C APIs to appropriately either not directly retain past the scope of the call or copy in the cases of long lasting references. Any Objective-C method or function that takes a NSData and just retains or unsafely stores it past the function callout is likely incorrect and has bugs no matter the language it was invoked in. This case where the Data is created in Swift to bridge it only needs to allocate the wrapper NSData but no O(n) copy needs to occur (unless it is holding a reference as previously stated).

The final case of bridging is when a Data is obtained from Objective-C and then passed directly back to Objective-C. The compiler has potentials of optimizations in direct callout cases such as returnsAData() as NSData with "peephole" optimizations but these are only really enforceable in limited scope (sometimes just the same line of code). Since the backing store can hold a reference to the reference type the bridge method (when not mutated) in those cases can pass that back over to Objective-C. For mutated versions a copy of that mutated version can be passed along as well (taking advantage of any optimizations the dynamic dispatch affords for calls to copy).

Detailed performance breakdown
Each graph below is a comparison between the Swift 3 Data and the new version of Data for each of the inline cases. The horizontal axis in each graph represent N and the vertical axis in each graph represents the sampled duration in nanoseconds. Each data set in the plots are an average over 100 (unless otherwise specified) per value of N. The attached graphs were generated from optimized builds on a Mac Pro (Late 2013) 3.5 GHz 6-Core Intel Xeon E5 with 16 GB 1866 MHz DDR3.

func createSampleData(ofLength N: Int) -> Data {
   var buffer = [UInt8](repeating: 0, count: N)
   return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Data in
       arc4random_buf(buffer.baseAddress!, N)
       return Data(bytes: buffer.baseAddress!, count: N)
   }
}

func createSampleDataReference(ofLength N: Int) -> NSData {
   var buffer = [UInt8](repeating: 0, count: N)
   return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> NSData in
       arc4random_buf(buffer.baseAddress!, N)
       return NSData(bytes: buffer.baseAddress, length: N)
   }
}

func createSampleArray(ofLength N: Int) -> [UInt8] {
   var buffer = [UInt8](repeating: 0, count: N)
   buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Void in
       arc4random_buf(buffer.baseAddress!, N)
   }
   return buffer
}

Accessing count

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data.count
// end measuring

<access_count.png>

Subscripting

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data[index]
// end measuring

<getting_subscript.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data[index] = 0x00
// end measuring

<setting_subscript.png>

Appending

This should be a O(N) operation

// setup
let dataToAppend = createSampleData(ofLength: N)
var data = Data()
// start measuring
data.append(dataToAppend)
// end measuring

<append_n_bytes_with_data.png>

// setup
let arrayToAppend = createSampleArray(ofLength: N)
var data = Data()
// start measuring
data.append(contentsOf: arrayToAppend)
// end measuring

<append_array_of_bytes.png>

The new version is still O(N) just a much smaller constant multiplier.

var data = Data()
// start measuring
for _ in 0..<N {
   data.append(contentsOf: [0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA])
}
//end measuring

<append_n_arrays.png>

Replacing sub ranges

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<N, with: replacement)
// end measuring

<replace_entire_subrange.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<min(N, 5), with: replacement)
// end measuring

<replace_fixed_subrange.png>

Growth of count

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_small.png>

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_large.png>

// setup
var data = Data()
data.count = starting
// start measuring
data.count = N
// end measuring

<grow_from_mid_to_large.png>

Bridging to reference types

This should be a O(1) operation. In bridging to a reference case the previous implementation was a bit faster. The only real extra overhead here is an allocation of the NSData object since the Swift backed Data has no existing reference type to pass along. There are a few extra optimizations that can be done in this path to reduce it by the approximately 150 nanosecond difference. In practice the cases where Data is being bridged back out to Objective-C are usually cases like writing to a file or socket which dwarf that 150 nanosecond differential.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data as NSData
// end measuring

<bridge_to_objectivec.png>

Bridging from reference types

This should be a O(1) operation

// setup
let data = createSampleDataReference(ofLength: N)
// start measuring
_ = data as Data
// end measuring

<bridge_from_objectivec.png>
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-server-dev mailing list
swift-server-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-server-dev

_______________________________________________
swift-server-dev mailing list
swift-server-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-server-dev


(Alex Blewitt) #5

This might be of interest to the swift-server-dev community who aren't on either of the other lists.

Thanks.

Hm, I don’t know. Something I don’t quite get is why this is even required. What is really different between that `Data` and a `[ UInt8 ]`? Just the bridging to NSData (which we don’t have on Linux), and couldn’t all that bridging stuff described be done against a special [UInt8] array-backend class too?

For one thing, a [UInt8] backed type has a fixed structure in memory (for example, it must be contiguous). A Data type has an opaque storage type which could permit a linked list of blocks (for example). It's also possible to create a Data object which is based on a subset of an original Data without copying. There are also custom deallocator routines which do things differently based on what type the data came from (for example, unmap or free).

Of course the other reason is that Data exists across a number of existing APIs and that's unlikely to change. So making this more efficient will help those use cases.

The performance patterns attached are focused around a single data object. Also: "Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large)”.
For the server I think the focus needs to be quite different.

So what are the requirements for a ‘byte bucket’ data structure on the server. Some stuff I can think of:

- There are a lot (“millions") of read and write calls with
tons of small byte buffers (essentially socket buffer
sizes, but when framing has been processed even smaller ones ...)

Yes, that's true. Dispatch has a mechanism to provide these Data types as well.

- Copying of those buffers should be avoided as much as possible
- Ideally it should be possible to reuse such buffers (instead
   of malloc/free’ing them at a high rate)

I concur. In fact the custom Data deallocator is precisely for this kind of purpose; you can use it to create a magazine of such Data elements, and then return them to the Data pool when they're finished with.

https://developer.apple.com/reference/foundation/data.deallocator

- It is rather rare that the buffers in servers are itself
modified, usually you create a new, transformed buffer (or
other object, like a String or Int) out of existing ones
- except for one operation: you often slice them (e.g. when
   encountering a HTTP chunk marker, or some other framing in
   a protocol or data format).
- that might imply that the CoW overhead - even if small -
   may be unnecessarily high
- instead there needs to be a focus an dealing with ranges

Data has a method "split" which can be used to return multiple slices of the underlying Data object, which is inherited from the Collection type. This gives an array of ranges, which can be used to pass on to other elements.

- It should be highly efficient to create String’s from such
byte buckets (arrays). while it should be delayed to create
such to the latest point possible, it quite often will be
the API the top-level framework user is expecting (JSON,
query-strings, XML, whatever).

String can be used to be created from a Data with init:

https://developer.apple.com/reference/swift/string/1418413-init

I would expect that a high-performance JSON or XML parser that takes a Data would probably find it more efficient to iterate through the contents directly rather than going through a String representation first, though.

- There needs to be something to represent arrays of those
buckets, Apache calls them brigades. One could use Arrays
for such, but maybe they should be lists instead.
(this is required to avoid copying split buffers into
  larger ones, and is supported by writev/readv)

- Except for some special cases like caches - unlike mentioned
in the proposal - server apps should rarely ‘source byte
buckets from disks’.
Instead of loading the buckets into memory, maybe there
should be a way to have byte-buckets which are files.
NSData could do mmap’ed files, what you more often really
want in the server context is `sendfile()` though.
- some kind of a ‘FileDescriptorBackedBuffer’ protocol

- Scanning of buffers should be really fast - and by that I
mean C `index()` like fast, not `for byte in bytes {}` array
fast. But that is desirable for [UInt8] too :slight_smile:

There is probably more, but I think the stuff above are things that should be addressed :slight_smile:

I think performance is definitely high up on the list of things that are required, but we should also look at the aspects of using the standard buffer type provided by the platform. If we have performance tests/metrics that show a performance delta to using a plan [UInt8] or equivalent then it would be good to test those or provide them as test cases for the core Swift team.

Alex

···

On 29 Nov 2016, at 14:12, Helge Heß via swift-server-dev <swift-server-dev@swift.org> wrote:
On 28 Nov 2016, at 20:04, Alex Blewitt via swift-server-dev <swift-server-dev@swift.org <mailto:swift-server-dev@swift.org>> wrote:

It may also be worth considering whether that ‘buffer’ data structure should only be able to carry bytes, or other payloads as well (along the lines of http://noze.io/noze4nonnode/#streams). But maybe that is too ambitious for v1 :slight_smile:

hh

P.S.: This may be good reading material: http://www.apachetutor.org/dev/brigades

Alex

Begin forwarded message:

From: Philippe Hausler via swift-dev <swift-dev@swift.org>
Subject: [swift-dev] Performance Refinement of Data
Date: 19 November 2016 at 19:10:32 GMT
To: swift-dev@swift.org, swift-corelibs-dev@swift.org
Reply-To: Philippe Hausler <phausler@apple.com>

Performance Refinement of Data
Introduction
In Swift 3 the Foundation team introduced a new structural type to represent NSData exposed into Swift.

Data allows developers to interact with binary data with value semantics, which is often more appropriate than using pointers like UnsafeMutablePointer. Having an encapsulating type to abstract the common operations is often quite advantageous to tasks like parsing, where types like String may not be appropriate for or have hidden performance "gotchas".

Data can easily be a critical point of performance for many tasks. Data is an appropriate common currency of transacting a safe managed buffer of bytes that interoperates well with existing Cocoa APIs. This means that it should be tuned for performance as much as possible.

Motivation
There are several outstanding performance improvements which can be made to Data; issues regarding Data's performance have been raised with the Foundation team both publicly and privately, and we would like to address those.

Data should be as fast as possible. Currently, most of the backing of Data is implemented in Foundation, while being quite fast for reallocations and other operations, this means that calls are made between Swift and Objective-C even for simple things like count for every access.

This Swift–Objective-C boundary means that no inlining can be performed across it; even when we have full control over the backing implementation and the caller, what would normally be just a single offset load instructions ends up becoming many just for the benefit of an objc_msgSend (not to mention ARC operations). Even though the calls are heavily optimized, they will never be as fast as a single instruction.

Proposed solution
In order to make Data as fast as possible the implementation needs to be inlined; since that is one of the best performance optimizations that Swift can offer. To do this it requires a re-think of how Data and NSData interact. This means that the structure Data will need to adopt certain attributes that will allow it to be inlined into the call-sites. The function dispatch overhead will be reduced but also optimization passes like cold paths and branch elimination can be possibilities for the compiler to do the best possible thing for the call site.

Data will adopt the annotation @inline(__always) in key locations and use a non-Objective-C backing class to store the pointer and length (and other internal ivars). That backing object will allow for a reduction of capture points as well as avoid extra retain/releases that were added for mutation detection.

Instead of using _MutablePairBoxing (which uses closures to map invocations to references or apply mutations with copy on write semantics) the new backing implementation can easily be applied with copy on write semantics without any Unmanaged "dancing". That "dancing" complicates code and it can be made considerably simpler. Furthermore, avoiding this "dance" can reduce the calls to retain and release down to zero in the application of mutations in unique referenced cases as well as mapping non mutating invocations to backing storage.

Subclassing the reference type NSData is still something that Foundation should support for the wrapping of the reference type in a structure. This effectively means there are five types of backing for Data: Swift-implemented, immutable NSData, mutable NSMutableData, custom subclasses of NSData, and custom subclasses of NSMutableData. These specific cases are delineated to offer the most efficient inline cases possible.

Since Foundation can enforce a no dynamic dispatch needed contract with itself in the cases of the standard class cluster members of NSData and NSMutableData Foundation can assure these cases are acceptable to not need dynamic dispatch for every time bytes or length are accessed and the values can be cached until the data is mutated or disposed of. In the cases where a subclass is used of course all bets are off and every point requires dynamically calling out.

In short this will mean that fetching the count of a Data can be optimized to a single branch and load from an offset and this same optimization can be applied to many other methods on Data.

Bridging to and from Objective-C
Many of the sources that Data is derived from are sourced from the SDKs written in Objective-C. For other types like Array,Set, Dictionary, or String the objects returned are many times not very large. Arrays may have a handful of objects, strings may only be a few hundred characters and so on. In these small cases it makes sense to "eagerly" bridge those reference types into a more inline-able version (there are exclusions to this but in general it is most often the case).

Data does not follow this pattern. Often it is sourced from files on disk (which could be exceedingly large) or results from network transactions of downloads. These cases would definitely suffer from having an "eager" O(n) bridge; due to not only memory allocation duplications to hold both backing stores but also to the byte copy from the reference type to the value type. Data should be fast no matter where it came from unless it is truly unknown on it's dynamic dispatch requirements.

To build a Data that is fast for inline optimizations the bytes pointer and length need to be cached for the duration of the object. When as is used to bridge a custom reference to Data dynamic dispatch must occur on every call to count and every time bytes are touched but if the Data is known to be obtained from a source that we can control the dynamic dispatch expectations that dynamic dispatch can be elided and behavior can be preserved by mimicking the Objective-C implementation in Swift.

Bridging in the other direction also has some distinct performance optimizations that can be taken advantage of as well.

When the lifespan of the callout to Objective-C is well known the cases of Swift constructed Data can easily pass a NSData with a no-copy of the backing buffer. It is the responsibility of the Objective-C APIs to appropriately either not directly retain past the scope of the call or copy in the cases of long lasting references. Any Objective-C method or function that takes a NSData and just retains or unsafely stores it past the function callout is likely incorrect and has bugs no matter the language it was invoked in. This case where the Data is created in Swift to bridge it only needs to allocate the wrapper NSData but no O(n) copy needs to occur (unless it is holding a reference as previously stated).

The final case of bridging is when a Data is obtained from Objective-C and then passed directly back to Objective-C. The compiler has potentials of optimizations in direct callout cases such as returnsAData() as NSData with "peephole" optimizations but these are only really enforceable in limited scope (sometimes just the same line of code). Since the backing store can hold a reference to the reference type the bridge method (when not mutated) in those cases can pass that back over to Objective-C. For mutated versions a copy of that mutated version can be passed along as well (taking advantage of any optimizations the dynamic dispatch affords for calls to copy).

Detailed performance breakdown
Each graph below is a comparison between the Swift 3 Data and the new version of Data for each of the inline cases. The horizontal axis in each graph represent N and the vertical axis in each graph represents the sampled duration in nanoseconds. Each data set in the plots are an average over 100 (unless otherwise specified) per value of N. The attached graphs were generated from optimized builds on a Mac Pro (Late 2013) 3.5 GHz 6-Core Intel Xeon E5 with 16 GB 1866 MHz DDR3.

func createSampleData(ofLength N: Int) -> Data {
   var buffer = [UInt8](repeating: 0, count: N)
   return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Data in
       arc4random_buf(buffer.baseAddress!, N)
       return Data(bytes: buffer.baseAddress!, count: N)
   }
}

func createSampleDataReference(ofLength N: Int) -> NSData {
   var buffer = [UInt8](repeating: 0, count: N)
   return buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> NSData in
       arc4random_buf(buffer.baseAddress!, N)
       return NSData(bytes: buffer.baseAddress, length: N)
   }
}

func createSampleArray(ofLength N: Int) -> [UInt8] {
   var buffer = [UInt8](repeating: 0, count: N)
   buffer.withUnsafeMutableBytes { (buffer: UnsafeMutableRawBufferPointer) -> Void in
       arc4random_buf(buffer.baseAddress!, N)
   }
   return buffer
}

Accessing count

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data.count
// end measuring

<access_count.png>

Subscripting

This should be a O(1) operation. The y axis is measured in nanoseconds sampled over 100000 iterations.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data[index]
// end measuring

<getting_subscript.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data[index] = 0x00
// end measuring

<setting_subscript.png>

Appending

This should be a O(N) operation

// setup
let dataToAppend = createSampleData(ofLength: N)
var data = Data()
// start measuring
data.append(dataToAppend)
// end measuring

<append_n_bytes_with_data.png>

// setup
let arrayToAppend = createSampleArray(ofLength: N)
var data = Data()
// start measuring
data.append(contentsOf: arrayToAppend)
// end measuring

<append_array_of_bytes.png>

The new version is still O(N) just a much smaller constant multiplier.

var data = Data()
// start measuring
for _ in 0..<N {
   data.append(contentsOf: [0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA])
}
//end measuring

<append_n_arrays.png>

Replacing sub ranges

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<N, with: replacement)
// end measuring

<replace_entire_subrange.png>

// setup
var data = createSampleData(ofLength: N)
// start measuring
data.replaceSubrange(0..<min(N, 5), with: replacement)
// end measuring

<replace_fixed_subrange.png>

Growth of count

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_small.png>

// setup
var data = Data()
// start measuring
data.count = N
// end measuring

<grow_large.png>

// setup
var data = Data()
data.count = starting
// start measuring
data.count = N
// end measuring

<grow_from_mid_to_large.png>

Bridging to reference types

This should be a O(1) operation. In bridging to a reference case the previous implementation was a bit faster. The only real extra overhead here is an allocation of the NSData object since the Swift backed Data has no existing reference type to pass along. There are a few extra optimizations that can be done in this path to reduce it by the approximately 150 nanosecond difference. In practice the cases where Data is being bridged back out to Objective-C are usually cases like writing to a file or socket which dwarf that 150 nanosecond differential.

// setup
let data = createSampleData(ofLength: N)
// start measuring
_ = data as NSData
// end measuring

<bridge_to_objectivec.png>

Bridging from reference types

This should be a O(1) operation

// setup
let data = createSampleDataReference(ofLength: N)
// start measuring
_ = data as Data
// end measuring

<bridge_from_objectivec.png>
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-server-dev mailing list
swift-server-dev@swift.org <mailto:swift-server-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-server-dev

_______________________________________________
swift-server-dev mailing list
swift-server-dev@swift.org <mailto:swift-server-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-server-dev


(Helge Heß) #6

Furthermore another very common task is to build up buffers of data (appending small chunks; buckets in the brigade in the parlance of Apache).

I was originally confused by “build up buffers of data”, you mean “build up buffers of buffers of data” :slight_smile:

This is pretty common, yes. Notably this is often “small buffers of large buffers of data”. Like if you have a markdown template like that:

  … tons of html …
  {{ today }}
  … tons of html …

It may be desirable to represent that as a brigade of

  [FileData offset:0 length:nn] // use sendfile()
  [MemoryData] // regular send() here
  [FileData offset:n + 11 length:-1] // sendfile()

E.g. in a CalDAV file-based server that would be pretty common. If you fetch batches of CalDAV events, you would ideally stream them directly off-store, with a lot of static MemoryData for the XML wrapping and very few dynamic Data objects for things like URLs or sync-tokens.

There is a caveat here of course; if the end result will not be needed to be contiguous DispatchData may be a better type in the end for this but that of course depends on your usage.

I think this is the common case on the server today. Loading everything into memory is often the naive/simpler way to get stuff running quickly. But if you look at it, many of the backend servers are just fancy proxies primarily translating between different representations of the data (like PostgreSQL-wire protocol to JSON). More often than not you don’t really need to spool up the data here but can just transform on the fly.

This is often (not always) a little different on the client side. E.g. if you load an image from a server via HTTP, you quite likely need it in full and place it into an on-disk cache anyways. And then you just grab one mmap-data of that resource once it is fully downloaded.

There is a reason why we have both contiguous data (NSData) and discontiguous data (_NSDispatchData/OS_dispatch_data/DispatchData). So those portions aside of differing usage patterns I will focus on the contiguous case since that is what my proposal is about.

I’m a little confused about that because it seems to conflict with what Alex has been saying. I think what we need for the server effort is clearly 'discontiguous data’. Is this going to be supported by `Data` (as in NSData)? Something which we can pass to readv and writev (and companions).

hh

···

On Nov 29, 2016, at 4:57 PM, Philippe Hausler <phausler@apple.com> wrote:


(Helge Heß) #7

Hm, I don’t know. Something I don’t quite get is why this is even required. What is really different between that `Data` and a `[ UInt8 ]`? Just the bridging to NSData (which we don’t have on Linux), and couldn’t all that bridging stuff described be done against a special [UInt8] array-backend class too?

For one thing, a [UInt8] backed type has a fixed structure in memory (for example, it must be contiguous).

I thought that is the difference between Array and ContiguousArray? (Array not requiring a fixed structure in memory).

My thinking was that if an Array<ObjC> can be backed by an NSArray, and Array<UInt8> could be backed by an NSData (or something similar) by the same ‘magic'. The thing missing would be access to physical ranges of a non-contiguous Array, but that is desirable for all types, not just bytes.

I don’t care too much, but I think people will find it kinda weird what the difference is between an array of bytes and an array of bytes :slight_smile: Would be kinda beautiful to make them the same.

A Data type has an opaque storage type which could permit a linked list of blocks (for example). It's also possible to create a Data object which is based on a subset of an original Data without copying. There are also custom deallocator routines which do things differently based on what type the data came from (for example, unmap or free).

I’m not entirely sure what you are talking about. About the `Data` type described in the proposal you sent, or `NSData` (mapped to `Data` right now, the one the proposal is targeting to replace), or `DispatchData`?

The proposal doesn’t talk at all about the opaque storage aspects you are mentioning (linked Data etc) hence I was assuming it is just a buffer. It sounds like it is considered implicit that the new `Data` would preserve all the functionality of NSData?

Most of your replies relate to the `NSData backed Data`, I skip those as I was just referring to the proposal. Yes, the ‘old’ Data provides a lot of the facilities required.

BTW: While Data can iterate over sub buffers within a closure, I don’t think it can give you a list of pointers to them for writev() and companions?

Of course the other reason is that Data exists across a number of existing APIs and that's unlikely to change. So making this more efficient will help those use cases.

That one is easy: `typealias Data = [UInt8]` :wink:

I would expect that a high-performance JSON or XML parser that takes a Data would probably find it more efficient to iterate through the contents directly rather than going through a String representation first, though.

I agree, but eventually it’ll (or part of it) get converted to some String. The top-level users will get to see such, presumably.

I think performance is definitely high up on the list of things that are required, but we should also look at the aspects of using the standard buffer type provided by the platform. If we have performance tests/metrics that show a performance delta to using a plan [UInt8] or equivalent then it would be good to test those or provide them as test cases for the core Swift team.

Yes, sorry. I was only looking at the proposal you sent which doesn’t mention any of the facilities you do nor includes performance stats for those (which are more important to the server than the one included). That the new `Data` would have all the facilities of NSData wasn’t obvious to me.

hh

···

On 29 Nov 2016, at 16:36, Alex Blewitt <alblue@apple.com> wrote: