[PITCH] Pitch v2 for io_uring support in Swift System on Linux

Previously: [PITCH] io_uring support in Swift System on Linux

Updated pitch: swift-system/NNNN-swift-system-io-uring.md at david/ioring · apple/swift-system · GitHub

Changes since the original pitch:

  • Nested the other types in IORing
  • Switched to throwing Errno for errors
  • Naming changes for many APIs
  • Support for setup flags on IORing
  • Feature detection is now an OptionSet
  • Cancellation support
  • Expanded alternatives considered and future directions
  • Lots of miscellaneous cleanup and fixes
  • Registered resources are structs instead of classes

I think this is in reasonably good shape now. Next steps are trying out adopting it in Subprocess, and adding more supported operations as needed for adopters.


IORing, a Swift System API for io_uring

Introduction

io_uring is Linux's solution to asynchronous and batched syscalls, with a particular focus on IO. We propose a low-level Swift API for it in Swift System that could either be used directly by projects with unusual needs, or via intermediaries like Swift NIO, to address scalability and thread pool starvation issues.

Motivation

Up until recently, the overwhelmingly dominant file IO syscalls on major Unix platforms have been synchronous, e.g. read(2). This design is very simple and proved sufficient for many uses for decades, but is less than ideal for Swift's needs in a few major ways:

  1. Requiring an entire OS thread for each concurrent operation imposes significant memory overhead
  2. Requiring a separate syscall for each operation imposes significant CPU/time overhead to switch into and out of kernel mode repeatedly. This has been exacerbated in recent years by mitigations for the Meltdown family of security exploits increasing the cost of syscalls.
  3. Swift's N:M coroutine-on-thread-pool concurrency model assumes that threads will not be blocked. Each thread waiting for a syscall means a CPU core being left idle. In practice systems like NIO that deal in highly concurrent IO have had to work around this by providing their own thread pools.

Non-file IO (network, pipes, etc…) has been in a somewhat better place with epoll and kqueue for asynchronously waiting for readability, but syscall overhead remains a significant issue for highly scalable systems.

With the introduction of io_uring in 2019, Linux now has the kernel level tools to address these three problems directly. However, io_uring is quite complex and maps poorly into Swift. We expect that by providing a Swift interface to it, we can enable Swift on Linux servers to scale better and be more efficient than it has been in the past.

Proposed solution

We propose a low level, unopinionated Swift interface for io_uring on Linux (see Future Directions for discussion of possible more abstract interfaces).

struct IORing: ~Copyable provides facilities for

  • Registering and unregistering resources (files and buffers), an io_uring specific variation on Unix file IOdescriptors that improves their efficiency
  • Registering and unregistering eventfds, which allow asynchronous waiting for completions
  • Enqueueing IO requests
  • Dequeueing IO completions

struct IORing.RegisteredResource<T> represents, via its two typealiases IORing.RegisteredFile and IORing.RegisteredBuffer, registered file descriptors and buffers.

struct IORing.Request: ~Copyable represents an IO operation that can be enqueued for the kernel to execute. It supports a wide variety of operations matching traditional unix file and socket operations.

Request operations are expressed as overloaded static methods on Request, e.g. openat is spelled

    public static func open(
        _ path: FilePath,
        in directory: FileDescriptor,
        into slot: IORing.RegisteredFile,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> Request

    public static func open(
        _ path: FilePath,
        in directory: FileDescriptor,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> Request

which allows clients to decide whether they want to open the file into a slot on the ring, or have it return a file descriptor via a completion. Similarly, read operations have overloads for "use a buffer from the ring" or "read into this UnsafeMutableBufferPointer"

Multiple Requests can be enqueued on a single IORing using the prepare(…) family of methods, and then submitted together using submitPreparedRequests, allowing for things like "open this file, read its contents, and then close it" to be a single syscall. Conveniences are provided for preparing and submitting requests in one call.

Since IO operations can execute in parallel or out of order by default, linked chains of operations can be established with prepare(linkedRequests:…) and related methods. Separate chains can still execute in parallel, and if an operation early in the chain fails, all subsequent operations will deliver cancellation errors as their completion.

Already-completed results can be retrieved from the ring using tryConsumeCompletion, which never waits but may return nil, or blockingConsumeCompletion(timeout:), which synchronously waits (up to an optional timeout) until an operation completes. There's also a bulk version of blockingConsumeCompletion, which may reduce the number of syscalls issued. It takes a closure which will be called repeatedly as completions are available (see Future Directions for potential improvements to this API).

Since neither polling nor synchronously waiting is optimal in many cases, IORing also exposes the ability to register an eventfd (see man eventfd(2)), which will become readable when completions are available on the ring. This can then be monitored asynchronously with epoll, kqueue, or for clients who are linking libdispatch, DispatchSource.

struct IORing.Completion: ~Copyable represents the result of an IO operation and provides

  • Flags indicating various operation-specific metadata about the now-completed syscall
  • The context associated with the operation when it was enqueued, as an UnsafeRawPointer or a UInt64
  • The result of the operation, as an Int32 with operation-specific meaning
  • The error, if one occurred

Unfortunately the underlying kernel API makes it relatively difficult to determine which Request led to a given Completion, so it's expected that users will need to create this association themselves via the context parameter.

IORing.Features describes the supported features of the underlying kernel IORing implementation, which can be used to provide graceful reduction in functionality when running on older systems.

Detailed design

// IORing is intentionally not Sendable, to avoid internal locking overhead
public struct IORing: ~Copyable {

  public init(queueDepth: UInt32, flags: IORing.SetupFlags = []) throws(Errno)
  
  public struct SetupFlags: OptionSet, RawRepresentable, Hashable {
    public var rawValue: UInt32
    public init(rawValue: UInt32)
    public static var pollCompletions: SetupFlags //IORING_SETUP_IOPOLL
    public static var pollSubmissions: SetupFlags //IORING_SETUP_SQPOLL
    public static var clampMaxEntries: SetupFlags //IORING_SETUP_CLAMP
    public static var startDisabled: SetupFlags //IORING_SETUP_R_DISABLED
    public static var continueSubmittingOnError: SetupFlags //IORING_SETUP_SUBMIT_ALL
    public static var singleSubmissionThread: SetupFlags //IORING_SETUP_SINGLE_ISSUER
    public static var deferRunningTasks: SetupFlags //IORING_SETUP_DEFER_TASKRUN
  }

  public mutating func registerEventFD(_ descriptor: FileDescriptor) throws(Errno)
  public mutating func unregisterEventFD() throws(Errno)
 
  public struct RegisteredResource<T> { }
  public typealias RegisteredFile = RegisteredResource<UInt32>
  public typealias RegisteredBuffer = RegisteredResource<iovec> 
  
  // A `RegisteredResources` is a view into the buffers or files registered with the ring, if any
  public struct RegisteredResources<T>: RandomAccessCollection {
    public subscript(position: Int) -> RegisteredResource<T>
    public subscript(position: UInt16) -> RegisteredResource<T> // This is useful because io_uring likes to use UInt16s as indexes
  }
	
	public mutating func registerFileSlots(count: Int) throws(Errno) -> RegisteredResources<RegisteredFile.Resource>
	public func unregisterFiles()
	public var registeredFileSlots: RegisteredResources<RegisteredFile.Resource>
	
	public mutating func registerBuffers(
		_ buffers: some Collection<UnsafeMutableRawBufferPointer>
	) throws(Errno) -> RegisteredResources<RegisteredBuffer.Resource>
	
	public mutating func registerBuffers(
		_ buffers: UnsafeMutableRawBufferPointer...
	) throws(Errno) -> RegisteredResources<RegisteredBuffer.Resource>
	
	public func unregisterBuffers()
	
	public var registeredBuffers: RegisteredResources<RegisteredBuffer.Resource>
	
	public func prepare(requests: Request...)
	public func prepare(linkedRequests: Request...)
	
	public func submitPreparedRequests(timeout: Duration? = nil) throws(Errno)
	public func submit(requests: Request..., timeout: Duration? = nil) throws(Errno)
	public func submit(linkedRequests: Request..., timeout: Duration? = nil) throws(Errno)
	
	public func submitPreparedRequests() throws(Errno)
	public func submitPreparedRequestsAndWait(timeout: Duration? = nil) throws(Errno)
	
	public func submitPreparedRequestsAndConsumeCompletions(
    minimumCount: UInt32 = 1,
    timeout: Duration? = nil,
    consumer: (consuming Completion?, Errno?, Bool) throws(E) -> Void
  ) throws(E)
	
	public func blockingConsumeCompletion(
    timeout: Duration? = nil
	) throws(Errno) -> Completion
    
	public func blockingConsumeCompletions<E>(
    minimumCount: UInt32 = 1,
    timeout: Duration? = nil,
    consumer: (consuming Completion?, Errno?, Bool) throws(E) -> Void
	) throws(E)
    
	public func tryConsumeCompletion() -> Completion?
	
	public struct Features: OptionSet, RawRepresentable, Hashable {
		let rawValue: UInt32
		
		public init(rawValue: UInt32)
		
		//IORING_FEAT_SINGLE_MMAP is handled internally
		public static let nonDroppingCompletions: Features //IORING_FEAT_NODROP
		public static let stableSubmissions: Features //IORING_FEAT_SUBMIT_STABLE
		public static let currentFilePosition: Features //IORING_FEAT_RW_CUR_POS
		public static let assumingTaskCredentials: Features //IORING_FEAT_CUR_PERSONALITY
		public static let fastPolling: Features //IORING_FEAT_FAST_POLL
		public static let epoll32BitFlags: Features //IORING_FEAT_POLL_32BITS
		public static let pollNonFixedFiles: Features //IORING_FEAT_SQPOLL_NONFIXED
		public static let extendedArguments: Features //IORING_FEAT_EXT_ARG
		public static let nativeWorkers: Features //IORING_FEAT_NATIVE_WORKERS
		public static let resourceTags: Features //IORING_FEAT_RSRC_TAGS
		public static let allowsSkippingSuccessfulCompletions: Features //IORING_FEAT_CQE_SKIP
		public static let improvedLinkedFiles: Features //IORING_FEAT_LINKED_FILE
		public static let registerRegisteredRings: Features //IORING_FEAT_REG_REG_RING
		public static let minimumTimeout: Features //IORING_FEAT_MIN_TIMEOUT
		public static let bundledSendReceive: Features //IORING_FEAT_RECVSEND_BUNDLE
	}
	public var supportedFeatures: Features
}

public extension IORing.RegisteredBuffer {
  var unsafeBuffer: UnsafeMutableRawBufferPointer
}

public extension IORing {
  struct Request: ~Copyable {
    public static func nop(context: UInt64 = 0) -> Request

    // overloads for each combination of registered vs unregistered buffer/descriptor
    // Read
    public static func read(
      _ file: IORing.RegisteredFile,
      into buffer: IORing.RegisteredBuffer,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    public static func read(
      _ file: FileDescriptor,
      into buffer: IORing.RegisteredBuffer,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    public static func read(
      _ file: IORing.RegisteredFile,
      into buffer: UnsafeMutableRawBufferPointer,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    public static func read(
      _ file: FileDescriptor,
      into buffer: UnsafeMutableRawBufferPointer,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    // Write
    public static func write(
      _ buffer: IORing.RegisteredBuffer,
      into file: IORing.RegisteredFile,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    public static func write(
      _ buffer: IORing.RegisteredBuffer,
      into file: FileDescriptor,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request 

    public static func write(
      _ buffer: UnsafeMutableRawBufferPointer,
      into file: IORing.RegisteredFile,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    public static func write(
      _ buffer: UnsafeMutableRawBufferPointer,
      into file: FileDescriptor,
      at offset: UInt64 = 0,
      context: UInt64 = 0
    ) -> Request

    // Close
    public static func close(
      _ file: FileDescriptor,
      context: UInt64 = 0
    ) -> Request 

    public static func close(
      _ file: IORing.RegisteredFile,
      context: UInt64 = 0
    ) -> Request

    // Open At
    public static func open(
      _ path: FilePath,
      in directory: FileDescriptor,
      into slot: IORing.RegisteredFile,
      mode: FileDescriptor.AccessMode,
      options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
      permissions: FilePermissions? = nil,
      context: UInt64 = 0
    ) -> Request

    public static func open(
      _ path: FilePath,
      in directory: FileDescriptor,
      mode: FileDescriptor.AccessMode,
      options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
      permissions: FilePermissions? = nil,
      context: UInt64 = 0
    ) -> Request 

    public static func unlink(
      _ path: FilePath,
      in directory: FileDescriptor,
      context: UInt64 = 0
    ) -> Request

    // Cancel

    public enum CancellationMatch {
      case all
      case first
    }

    public static func cancel(
      _ matchAll: CancellationMatch,
      matchingContext: UInt64,
    ) -> Request

    public static func cancel(
      _ matchAll: CancellationMatch,
      matching: FileDescriptor,
    ) -> Request

    public static func cancel(
      _ matchAll: CancellationMatch,
      matching: IORing.RegisteredFile,
    ) -> Request

    // Other operations follow in the same pattern
  }
  
  struct Completion {
    public struct Flags: OptionSet, Hashable, Codable {
      public let rawValue: UInt32

      public init(rawValue: UInt32)

      public static let moreCompletions: Flags
      public static let socketNotEmpty: Flags
      public static let isNotificationEvent: Flags
    }

    //These are both the same value, but having both eliminates some ugly casts in client code
    public var context: UInt64 
    public var contextPointer: UnsafeRawPointer
    
    public var result: Int32
    
    public var error: Errno? // Convenience wrapper over `result`
    
    public var flags: Flags  
  }
}
	

Usage Examples

Blocking

let ring = try IORing(queueDepth: 2)

//Make space on the ring for our file (this is optional, but improves performance with repeated use)
let file = ring.registerFiles(count: 1)[0]

var statInfo = Glibc.stat() // System doesn't have an abstraction for stat() right now
// Build our requests to open the file and find out how big it is
ring.prepare(linkedRequests:
	.open(path,
		in: parentDirectory,
		into: file,
		mode: mode,
    options: openOptions,
		permissions: nil
	),
	.stat(file, 
		into: &statInfo
	)
)
//batch submit 2 syscalls in 1!
try ring.submitPreparedRequestsAndConsumeCompletions(minimumCount: 2) { (completion: consuming Completion?, error, done) in
	if let error {
		throw error //or other error handling as desired
	}
} 

// We could register our buffer with the ring too, but we're only using it once
let buffer = UnsafeMutableRawBufferPointer.allocate(Int(statInfo.st_size))

// Build our requests to read the file and close it
ring.prepare(linkedRequests:
	 .read(file,
	 	into: buffer
	 ),
	 .close(file)
)

//batch submit 2 syscalls in 1!
try ring.submitPreparedRequestsAndConsumeCompletions(minimumCount: 2) { (completion: consuming Completion?, error, done) in
	if let error {
		throw error //or other error handling as desired
	}
}

processBuffer(buffer)

Using libdispatch to wait for the read asynchronously

//Initial setup as above up through creating buffer, omitted for brevity

//Make the read request with a context so we can get the buffer out of it in the completion handler
…
.read(file, into: buffer, context: UInt64(buffer.baseAddress!))
…

// Make an eventfd and register it with the ring
let eventfd = eventfd(0, 0)
ring.registerEventFD(eventfd)

// Make a read source to monitor the eventfd for readability
let readabilityMonitor = DispatchSource.makeReadSource(fileDescriptor: eventfd)
readabilityMonitor.setEventHandler {
	let completion = ring.blockingConsumeCompletion()
	if let error = completion.error {
		//handle failure to read the file
	}
	processBuffer(completion.contextPointer)
}
readabilityMonitor.activate()

ring.submitPreparedRequests //note, not "AndConsumeCompletions" this time

Source compatibility

This is an all-new API in Swift System, so has no backwards compatibility implications. Of note, though, this API is only available on Linux.

ABI compatibility

Swift on Linux does not have a stable ABI, and we will likely take advantage of this to evolve IORing as compiler support improves, as described in Future Directions.

Implications on adoption

This feature is intrinsically linked to Linux kernel support, so constrains the deployment target of anything that adopts it to newer kernels. Exactly which features of the evolving io_uring syscall surface area we need is under consideration.

Future directions

  • While most Swift users on Darwin are not limited by IO scalability issues, the thread pool considerations still make introducing something similar to this appealing if and when the relevant OS support is available. We should attempt to the best of our ability to not design this in a way that's gratuitously incompatible with non-Linux OSs, although Swift System does not attempt to have an API that's identical on all platforms.
  • The set of syscalls covered by io_uring has grown significantly and is still growing. We should leave room for supporting additional operations in the future.
  • Once same-element requirements and pack counts as integer generic arguments are supported by the compiler, we should consider adding something along the lines of the following to allow preparing, submitting, and waiting for an entire set of operations at once:
func submitLinkedRequestsAndWait<each Request>(
  _ requests: repeat each Request
) where Request == IORing.Request 
  -> InlineArray<(repeat each Request).count, IORing.Completion>
  • Once mutable borrows are supported, we should consider replacing the closure-taking bulk completion APIs (e.g. blockingConsumeCompletions(…)) with ones that return a sequence of completions instead
  • We should consider making more types noncopyable as compiler support improves
  • liburing has a "peek next completion" operation that doesn't consume it, and then a "mark consumed" operation. We may want to add something similar
  • liburing has support for operations allocating their own buffers and returning them via the completion, we may want to support this
  • We may want to provide API for asynchronously waiting, rather than just exposing the eventfd to let people roll their own async waits. Doing this really well has considerable implications for the concurrency runtime though.
  • We should almost certainly expose API for more of the configuration options in io_uring_setup
  • Stronger safety guarantees around cancellation and resource lifetimes (e.g. as described in Notes on io-uring) would be very welcome, but require an API that is much more strongly opinionated about how io_uring is used. A future higher level abstraction focused on the goal of being "an async IO API for Swift" rather than "a Swifty interface to io_uring" seems like a good place for that.

Alternatives considered

  • We could use a NIO-style separate thread pool, but we believe io_uring is likely a better option for scalability. We may still want to provide a thread-pool backed version as an option, because many Linux systems currently disable io_uring due to security concerns.
  • We could multiplex all IO onto a single actor as AsyncBytes currently does, but this has a number of downsides that make it entirely unsuitable to server usage. Most notably, it eliminates IO parallelism entirely.
  • Using POSIX AIO instead of or as well as io_uring would greatly increase our ability to support older kernels and other Unix systems, but it has well-documented performance and usability issues that have prevented its adoption elsewhere, and apply just as much to Swift.
  • Earlier versions of this proposal had higher level "managed" abstractions over IORing. These have been removed due to lack of interest from clients, but could be added back later if needed.
  • I considered having dedicated error types for IORing, but eventually decided throwing Errno was more consistent with other platform APIs
  • RegisteredResource was originally a class in an attempt to manage the lifetime of the resource via language features. Changing to the current model of it being a copyable struct didn't make the lifetime management any less safe (the IORing still owns the actual resource), and reduces overhead. In the future it would be neat if we could express RegisteredResources as being borrowed from the IORing so they can't be used after its lifetime.

Acknowledgments

The NIO team, in particular Cory Benfield and Franz Busch, have provided invaluable feedback and direction on this project.

7 Likes