[PITCH] io_uring support in Swift System on Linux

Hi folks!

In keeping with Swift System's policy of not being cross platform, I'd appreciate feedback on this low level abstraction for io_uring, Linux's new-ish batch async syscall API. The "future directions" and "alternatives considered" sections are pretty extensive, so I expect there will be a lot to discuss :smiley:

Draft proposal link for people who prefer markdown: swift-system/NNNN-swift-system-io-uring.md at 7fff872441b6acb03bb8922300c09ef5f2afae2b · apple/swift-system · GitHub

IORing, a Swift System API for io_uring

Introduction

io_uring is Linux's solution to asynchronous and batched syscalls, with a particular focus on IO. We propose a low-level Swift API for it in Swift System that could either be used directly by projects with unusual needs, or via intermediaries like Swift NIO, to address scalability and thread pool starvation issues.

Motivation

Up until recently, the overwhelmingly dominant file IO syscalls on major Unix platforms have been synchronous, e.g. read(2). This design is very simple and proved sufficient for many uses for decades, but is less than ideal for Swift's needs in a few major ways:

  1. Requiring an entire OS thread for each concurrent operation imposes significant memory overhead
  2. Requiring a separate syscall for each operation imposes significant CPU/time overhead to switch into and out of kernel mode repeatedly. This has been exacerbated in recent years by mitigations for the Spectre family of security exploits increasing the cost of syscalls.
  3. Swift's N:M coroutine-on-thread-pool concurrency model assumes that threads will not be blocked. Each thread waiting for a syscall means a CPU core being left idle. In practice systems like NIO that deal in highly concurrent IO have had to work around this by providing their own thread pools.

Non-file IO (network, pipes, etc…) has been in a somewhat better place with epoll and kqueue for asynchronously waiting for readability, but syscall overhead remains a significant issue for highly scalable systems.

With the introduction of io_uring in 2019, Linux now has the kernel level tools to address these three problems directly. However, io_uring is quite complex and maps poorly into Swift. We expect that by providing a Swift interface to it, we can enable Swift on Linux servers to scale better and be more efficient than it has been in the past.

Proposed solution

struct IORing: ~Copyable provides facilities for

  • Registering and unregistering resources (files and buffers), an io_uring specific variation on Unix file descriptors that improves their efficiency
  • Registering and unregistering eventfds, which allow asynchronous waiting for completions
  • Enqueueing IO requests
  • Dequeueing IO completions

class IOResource<T> represents, via its two typealiases IORingFileSlot and IORingBuffer, registered file descriptors and buffers. Ideally we'd express the lifetimes of these as being dependent on the lifetime of the ring, but so far that's proven intractable, so we use a reference type. We expect that the up-front overhead of this should be negligible for larger operations, and smaller or one-shot operations can use non-registered buffers and file descriptors.

struct IORequest: ~Copyable represents an IO operation that can be enqueued for the kernel to execute. It supports a wide variety of operations matching traditional unix file and socket operations.

IORequest operations are expressed as overloaded static methods on IORequest, e.g. openat is spelled

    public static func opening(
        _ path: FilePath,
        in directory: FileDescriptor,
        into slot: IORingFileSlot,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> IORequest

    public static func opening(
        _ path: FilePath,
        in directory: FileDescriptor,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> IORequest

which allows clients to decide whether they want to open the file into a slot on the ring, or have it return a file descriptor via a completion. Similarly, read operations have overloads for "use a buffer from the ring" or "read into this UnsafeMutableBufferPointer"

Multiple IORequests can be enqueued on a single IORing using the prepare(…) family of methods, and then submitted together using submitPreparedRequests, allowing for things like "open this file, read its contents, and then close it" to be a single syscall. Conveniences are provided for preparing and submitting requests in one call.

Since IO operations can execute in parallel or out of order by default, linked chains of operations can be established with prepare(linkedRequests:…) and related methods. Separate chains can still execute in parallel, and if an operation early in the chain fails, all subsequent operations will deliver cancellation errors as their completion.

Already-completed results can be retrieved from the ring using tryConsumeCompletion, which never waits but may return nil, or blockingConsumeCompletion(timeout:), which synchronously waits (up to an optional timeout) until an operation completes. There's also a bulk version of blockingConsumeCompletion, which may reduce the number of syscalls issued. It takes a closure which will be called repeatedly as completions are available (see Future Directions for potential improvements to this API).

Since neither polling nor synchronously waiting is optimal in many cases, IORing also exposes the ability to register an eventfd (see man eventfd(2)), which will become readable when completions are available on the ring. This can then be monitored asynchronously with epoll, kqueue, or for clients who are linking libdispatch, DispatchSource.

struct IOCompletion: ~Copyable represents the result of an IO operation and provides

  • Flags indicating various operation-specific metadata about the now-completed syscall
  • The context associated with the operation when it was enqueued, as an UnsafeRawPointer or a UInt64
  • The result of the operation, as an Int32 with operation-specific meaning
  • The error, if one occurred

Unfortunately the underlying kernel API makes it relatively difficult to determine which IORequest led to a given IOCompletion, so it's expected that users will need to create this association themselves via the context parameter.

IORingError represents failure of an operation.

IORing.Features describes the supported features of the underlying kernel IORing implementation, which can be used to provide graceful reduction in functionality when running on older systems.

Detailed design

public class IOResource<T> { }
public typealias IORingFileSlot = IOResource<UInt32>
public typealias IORingBuffer = IOResource<iovec>

extension IORingBuffer {
    public var unsafeBuffer: UnsafeMutableRawBufferPointer
}

// IORing is intentionally not Sendable, to avoid internal locking overhead
public struct IORing: ~Copyable {

	public init(queueDepth: UInt32) throws(IORingError)
	
	public mutating func registerEventFD(_ descriptor: FileDescriptor) throws(IORingError)
	public mutating func unregisterEventFD(_ descriptor: FileDescriptor) throws(IORingError)
	
	// An IORing.RegisteredResources is a view into the buffers or files registered with the ring, if any
	public struct RegisteredResources<T>: RandomAccessCollection {
		public subscript(position: Int) -> IOResource<T>
		public subscript(position: UInt16) -> IOResource<T> // This is useful because io_uring likes to use UInt16s as indexes
	}
	
	public mutating func registerFileSlots(count: Int) throws(IORingError) -> RegisteredResources<IORingFileSlot.Resource>
	
	public func unregisterFiles()
	
	public var registeredFileSlots: RegisteredResources<IORingFileSlot.Resource>
	
	public mutating func registerBuffers(
		_ buffers: some Collection<UnsafeMutableRawBufferPointer>
	) throws(IORingError) -> RegisteredResources<IORingBuffer.Resource>
	
	public mutating func registerBuffers(
		_ buffers: UnsafeMutableRawBufferPointer...
	) throws(IORingError) -> RegisteredResources<IORingBuffer.Resource>
	
	public func unregisterBuffers()
	
	public var registeredBuffers: RegisteredResources<IORingBuffer.Resource>
	
	public func prepare(requests: IORequest...)
	public func prepare(linkedRequests: IORequest...)
	
	public func submitPreparedRequests(timeout: Duration? = nil) throws(IORingError)
	public func submit(requests: IORequest..., timeout: Duration? = nil) throws(IORingError)
	public func submit(linkedRequests: IORequest..., timeout: Duration? = nil) throws(IORingError)
	
	public func submitPreparedRequests() throws(IORingError)
	public func submitPreparedRequestsAndWait(timeout: Duration? = nil) throws(IORingError)
	
	public func submitPreparedRequestsAndConsumeCompletions(
        minimumCount: UInt32 = 1,
        timeout: Duration? = nil,
        consumer: (consuming IOCompletion?, IORingError?, Bool) throws(E) -> Void
   ) throws(E)
	
	public func blockingConsumeCompletion(
       timeout: Duration? = nil
	) throws(IORingError) -> IOCompletion
    
	public func blockingConsumeCompletions<E>(
       minimumCount: UInt32 = 1,
       timeout: Duration? = nil,
		consumer: (consuming IOCompletion?, IORingError?, Bool) throws(E) -> Void
	) throws(E)
    
	public func tryConsumeCompletion() -> IOCompletion?
	
	public struct Features {
		//IORING_FEAT_SINGLE_MMAP is handled internally
		public var nonDroppingCompletions: Bool //IORING_FEAT_NODROP
		public var stableSubmissions: Bool //IORING_FEAT_SUBMIT_STABLE
		public var currentFilePosition: Bool //IORING_FEAT_RW_CUR_POS
		public var assumingTaskCredentials: Bool //IORING_FEAT_CUR_PERSONALITY
		public var fastPolling: Bool //IORING_FEAT_FAST_POLL
		public var epoll32BitFlags: Bool //IORING_FEAT_POLL_32BITS
		public var pollNonFixedFiles: Bool //IORING_FEAT_SQPOLL_NONFIXED
		public var extendedArguments: Bool //IORING_FEAT_EXT_ARG
		public var nativeWorkers: Bool //IORING_FEAT_NATIVE_WORKERS
		public var resourceTags: Bool //IORING_FEAT_RSRC_TAGS
		public var allowsSkippingSuccessfulCompletions: Bool //IORING_FEAT_CQE_SKIP
		public var improvedLinkedFiles: Bool //IORING_FEAT_LINKED_FILE
		public var registerRegisteredRings: Bool //IORING_FEAT_REG_REG_RING
		public var minimumTimeout: Bool //IORING_FEAT_MIN_TIMEOUT
		public var bundledSendReceive: Bool //IORING_FEAT_RECVSEND_BUNDLE
	}
	public static var supportedFeatures: Features
}

public struct IORequest: ~Copyable {
    public static func nop(context: UInt64 = 0) -> IORequest
	
	// overloads for each combination of registered vs unregistered buffer/descriptor
	// Read
    public static func reading(
        _ file: IORingFileSlot,
        into buffer: IORingBuffer,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
	
    public static func reading(
        _ file: FileDescriptor,
        into buffer: IORingBuffer,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    public static func reading(
        _ file: IORingFileSlot,
        into buffer: UnsafeMutableRawBufferPointer,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    public static func reading(
        _ file: FileDescriptor,
        into buffer: UnsafeMutableRawBufferPointer,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    // Write
    public static func writing(
        _ buffer: IORingBuffer,
        into file: IORingFileSlot,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    public static func writing(
        _ buffer: IORingBuffer,
        into file: FileDescriptor,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest 
    
    public static func writing(
        _ buffer: UnsafeMutableRawBufferPointer,
        into file: IORingFileSlot,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    public static func writing(
        _ buffer: UnsafeMutableRawBufferPointer,
        into file: FileDescriptor,
        at offset: UInt64 = 0,
        context: UInt64 = 0
    ) -> IORequest
    
    // Close
    public static func closing(
        _ file: FileDescriptor,
        context: UInt64 = 0
    ) -> IORequest 
    
    public static func closing(
        _ file: IORingFileSlot,
        context: UInt64 = 0
    ) -> IORequest
    
    // Open At
    public static func opening(
        _ path: FilePath,
        in directory: FileDescriptor,
        into slot: IORingFileSlot,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> IORequest
    
    public static func opening(
        _ path: FilePath,
        in directory: FileDescriptor,
        mode: FileDescriptor.AccessMode,
        options: FileDescriptor.OpenOptions = FileDescriptor.OpenOptions(),
        permissions: FilePermissions? = nil,
        context: UInt64 = 0
    ) -> IORequest 
    
    public static func unlinking(
        _ path: FilePath,
        in directory: FileDescriptor,
        context: UInt64 = 0
    ) -> IORequest
    
    // Other operations follow in the same pattern
}

public struct IOCompletion {

	public struct Flags: OptionSet, Hashable, Codable {
        public let rawValue: UInt32

        public init(rawValue: UInt32)
        
        public static let moreCompletions: Flags
        public static let socketNotEmpty: Flags
        public static let isNotificationEvent: Flags
   }

	//These are both the same value, but having both eliminates some ugly casts in client code
	public var context: UInt64 
	public var contextPointer: UnsafeRawPointer
	
	public var result: Int32
	
	public var error: IORingError? // Convenience wrapper over `result`
	
	public var flags: Flags	
}

public struct IORingError: Error, Equatable {
    static var missingRequiredFeatures: IORingError
    static var operationCanceled: IORingError
    static var timedOut: IORingError
    static var resourceRegistrationFailed: IORingError
    // Other error values to be filled out as the set of supported operations expands in the future
    static var unknown: IORingError(errorCode: Int)
}
	

Usage Examples

Blocking

let ring = try IORing(queueDepth: 2)

//Make space on the ring for our file (this is optional, but improves performance with repeated use)
let file = ring.registerFiles(count: 1)[0]

var statInfo = Glibc.stat() // System doesn't have an abstraction for stat() right now
// Build our requests to open the file and find out how big it is
ring.prepare(linkedRequests:
	.opening(path,
		in: parentDirectory,
		into: file,
		mode: mode,
   		options: openOptions,
		permissions: nil
	),
	.readingMetadataOf(file, 
		into: &statInfo
	)
)
//batch submit 2 syscalls in 1!
try ring.submitPreparedRequestsAndConsumeCompletions(minimumCount: 2) { (completion: consuming IOCompletion?, error, done) in
	if let error {
		throw error //or other error handling as desired
	}
} 

// We could register our buffer with the ring too, but we're only using it once
let buffer = UnsafeMutableRawBufferPointer.allocate(Int(statInfo.st_size))

// Build our requests to read the file and close it
ring.prepare(linkedRequests:
	 .reading(file,
	 	into: buffer
	 ),
	 .closing(file)
)

//batch submit 2 syscalls in 1!
try ring.submitPreparedRequestsAndConsumeCompletions(minimumCount: 2) { (completion: consuming IOCompletion?, error, done) in
	if let error {
		throw error //or other error handling as desired
	}
}

processBuffer(buffer)

Using libdispatch to wait for the read asynchronously

//Initial setup as above up through creating buffer, omitted for brevity

//Make the read request with a context so we can get the buffer out of it in the completion handler
…
.reading(file, into: buffer, context: UInt64(buffer.baseAddress!))
…

// Make an eventfd and register it with the ring
let eventfd = eventfd(0, 0)
ring.registerEventFD(eventfd)

// Make a read source to monitor the eventfd for readability
let readabilityMonitor = DispatchSource.makeReadSource(fileDescriptor: eventfd)
readabilityMonitor.setEventHandler {
	let completion = ring.blockingConsumeCompletion()
	if let error = completion.error {
		//handle failure to read the file
	}
	processBuffer(completion.contextPointer)
}
readabilityMonitor.activate()

ring.submitPreparedRequests //note, not "AndConsumeCompletions" this time

Source compatibility

This is an all-new API in Swift System, so has no backwards compatibility implications. Of note, though, this API is only available on Linux.

ABI compatibility

Swift on Linux does not have a stable ABI, and we will likely take advantage of this to evolve IORing as compiler support improves, as described in Future Directions.

Implications on adoption

This feature is intrinsically linked to Linux kernel support, so constrains the deployment target of anything that adopts it to newer kernels. Exactly which features of the evolving io_uring syscall surface area we need is under consideration.

Future directions

  • While most Swift users on Darwin are not limited by IO scalability issues, the thread pool considerations still make introducing something similar to this appealing if and when the relevant OS support is available. We should attempt to the best of our ability to not design this in a way that's gratuitously incompatible with non-Linux OSs, although Swift System does not attempt to have an API that's identical on all platforms.
  • The set of syscalls covered by io_uring has grown significantly and is still growing. We should leave room for supporting additional operations in the future.
  • Once same-element requirements and pack counts as integer generic arguments are supported by the compiler, we should consider adding something along the lines of the following to allow preparing, submitting, and waiting for an entire set of operations at once:
func submitLinkedRequestsAndWait<each Request>(
  _ requests: repeat each Request
) where Request == IORequest 
  -> InlineArray<(repeat each Request).count, IOCompletion>
  • Once mutable borrows are supported, we should consider replacing the closure-taking bulk completion APIs (e.g. blockingConsumeCompletions(…)) with ones that return a sequence of completions instead
  • We should consider making more types noncopyable as compiler support improves
  • liburing has a "peek next completion" operation that doesn't consume it, and then a "mark consumed" operation. We may want to add something similar
  • liburing has support for operations allocating their own buffers and returning them via the completion, we may want to support this
  • We may want to provide API for asynchronously waiting, rather than just exposing the eventfd to let people roll their own async waits. Doing this really well has considerable implications for the concurrency runtime though.
  • We should almost certainly expose API for more of the configuration options in io_uring_setup
  • The API for feature probing is functional but not especially nice. Finding a better way to present that concept would be desirable.

Alternatives considered

  • We could use a NIO-style separate thread pool, but we believe io_uring is likely a better option for scalability. We may still want to provide a thread-pool backed version as an option, because many Linux systems currently disable io_uring due to security concerns.
  • We could multiplex all IO onto a single actor as AsyncBytes currently does, but this has a number of downsides that make it entirely unsuitable to server usage. Most notably, it eliminates IO parallelism entirely.
  • Using POSIX AIO instead of or as well as io_uring would greatly increase our ability to support older kernels and other Unix systems, but it has well-documented performance and usability issues that have prevented its adoption elsewhere, and apply just as much to Swift.
  • Earlier versions of this proposal had higher level "managed" abstractions over IORing. These have been removed due to lack of interest from clients, but could be added back later if needed.
  • I considered making any or all of IORingError, IOCompletion, and IORequest nested struct declarations inside IORing. The main reason I haven't done so is I was a little concerned about the ambiguity of having a type called Error. I'd be particularly interested in feedback on this choice.
  • I personally find the gerund naming ("…ing") of IORequests pleasant to read, but it may make sense to use a different naming scheme

Acknowledgments

The NIO team, in particular @lukasa and @FranzBusch, have provided invaluable feedback and direction on this project.

21 Likes

So that is intended to be public and NOT open because you have subclasses that are opaque right?

what clock does that use? is it really a timeout? or should it optimally be a deadline of sorts?

That really feels like an OptionSet and not a wild-pack of bools.

More generally per the design: there are a number of context parameters, would it perhaps be interesting to leverage those and stuff a continuation inside? Those are pointer sized and could then make those functions be able to have async/await counterparts.

Im really glad to see some high performance IO stuff landing, but these feel rather "raw"; is there appetite to perhaps push for some more "high level but still high performance" versions?

1 Like

It should probably actually be final. Ideally it would be a noncopyable struct, but I ran into language limitations that prevented that.

This is the phrasing the underlying kernel API uses, we're not running our own timer at all. It might make sense to try to do a little concept translation in our layer though? I'll look into that.

As discussed in "Alternatives Considered", we initially had a higher level API built on this one, but it became clear that all our known interested parties (e.g. NIO) would use the low level one anyway, so we chose to refocus on it.

I would definitely like to see a higher level API that integrates with the Concurrency runtime, but doing that "Right" is interestingly complex. Whether it makes sense to do an "in between" API that's Swift Concurrency shaped but not actually integrating rings into the runtime is unclear to me (note: not saying it doesn't make sense, just that there's unanswered questions). What does "wake up a coroutine when epoll fires on this eventfd" look like internally for example?

4 Likes

Do you think these are fundamental limitations, or missing functionality we could add later?

2 Likes

I should make there's a bug report filed for everything I've run into. I know I filed "I want to be able to return an InlineArray with a count equal to the number of variadic arguments", and I think there were internal reports for most of the others, but I should verify they're all tracked.

(As far as I know none of it is a fundamental limitation)

1 Like

Very nice to see this, it'll be nice to have a Swift wrapper around the io_uring APIs!

My immediate reaction was on the opcode naming:

The "ing" are really weird. Why invent "-ing" isms for all the opcodes? Not all op codes will lean themselves to this very well, so it'll feel inconsistent. Like, would it be "tee-ing" and or what about IORING_OP_STATX or IORING_OP_SHUTDOWN ".shuttingDown"? Somehow making the opcodes "ing" when for some it'll work but for others less so seems weird to me. Since it's request opcodes the imperative "read" makes more sense as well, I'm not sure the "ing" add much clarity or context.

2 Likes

You're the second person today to react this way, so I think there may be something to it :slight_smile: I'll take a look at renaming

5 Likes

It should (at least theoretically) be possible to store a noncopyable value as a property of a wrapper class as a way to provide a copyable interface to borrowing the value. If possible, maybe you could factor IOResource into a boxing class and a noncopyable type, so that the latter is still available for the use cases where it does work today, and is ready when language limitations are lifted?

4 Likes

Seconded. I think it’s worth pointing to the work withoutboats has done here with ringbahn wrapping iouring in a safe interface. More important - at least to me - than async/await, ringbahn handles cancellation to prevent resource leaks, and moves ownership of buffers into the Ops themselves to prevent a class of errors where users prematurely free Op resources logically owned by the kernel post-submission.

I sketched a naive kevent-flavored thing that has callers register their interest in a central registry, and which becomes a pollable async sequence (in this naive example, for a long-running Task) written well before Swift had ownership. That works with epoll where you essentially just want to keep a mapping from fds to continuations (in the gist, implicitly via dispatching on the tags registered for each interested listener). For iouring I think something similiar-ish might work albeit with more management of the state of the completion and less tags.

I think the most awkward thing about async/await here is how it papers over the submission-based API and makes batching ops difficult. Like, it’d be awfully cool if you could write a set of async lets per op and have a global IO executor try to tie the whole shebang together in a more intelligent way than one submission per enqueued task, but then that’s more “push iouring into the runtime” as you were saying.

1 Like

Very happy to see work on io_uring - it is a perfect fit for SwiftNIO and Swift Concurrency.

I read through the PR and pitch, it would be great with a more complete example of how the use site looks when working with registered buffers if you don't mind? The ability to have multiple reads in-flight without having to poll is just super, but a bit hard to read from the API how that would look like, especially registering buffer groups and selecting them (if supported).

MIA is IORING_SETUP_SQPOLL support with related features - as well as support for the IORING_ATTACH_WQ flag in combination - nice to be able to have multiple rings which share a common kernel polling thread. It is great to be able to avoid having to context switch into the kernel for syscalls for servers with high throughput requirements, so hope to see that as a future direction.

We did some work on uring support earlier for SwiftNIO so commented on your PR where I saw a couple of potential issues that we ran into then.

(+1 for removing ing, but saw it was already addressed)

1 Like

Overall very excited for this! We have been wanting io_uring support in NIO for a long time. Before diving into the detailed bits I want to quickly reply to a comment @Philippe_Hausler brought up.

While I agree that we really want to have async/await native file I/O and for that matter all I/O, I was advocating strongly for this to not be part of the lower level abstractions. As @David_Smith alluded to integrating I/O with Concurrency is a larger undertaking, while we could come up with a bespoke solution for this io_uring it would require us to own an eventfd and a kqueue/epoll. I personally think the right approach here needs to be one level higher at inside Swift Concurrency itself. This has been a long standing issue that our Concurrency runtime doesn't integrate with I/O at all. I would like us to explore if we can potentially expand the various Executor APIs with I/O related methods. This way the executors themselves can leverage their already existing kqueue/epoll/etc. and I/O can be safely done from any context. Under the hood those executors can then integrate with the proposed io_uring APIs here.

I have a minor comments on the proposal itself.

class IOResource<T> { }

I agree with @Joe_Groff here that it would be great if we can explore making this ~Copyable and back it by a class so we can get rid of the class once the required language features are added. In general, being able to do io_uring allocation free (except the buffers) would be great.

Buffer usage

extension IORingBuffer {
    public var unsafeBuffer: UnsafeMutableRawBufferPointer
}

public mutating func registerBuffers(
  _ buffers: some Collection<UnsafeMutableRawBufferPointer>
) throws(IORingError) -> RegisteredResources<IORingBuffer.Resource>

This seems a bit dangerous do we want to adopt a with-style spelling or use RawSpan instead here?

Usage of FileDescriptor

FileDescriptor is the canonical type in swift-system however with this proposal its usage becomes a bit interesting. Currently, the type supports various methods such as reading/writing/closing directly. These methods are using the read/write/close sys calls directly. When adding io_uring to swift-system we are effetely adding another way to do all of these operations. I'm wondering what our overall strategy is going to be for the existing methods on FileDescriptor. Maybe it would be enough marking most of them as noasync.

2 Likes

Unfortunately you can't have some Collection<[Mutable]RawSpan>, because Collection requires that its element be Copyable and Escapable. @David_Smith what aspect of Collection do we need here vis-a-vis Sequence? Sequence doesn't support ~Copyable elements either, but at least a rough design of what that might look like is closer to being within reach (we still wouldn't be able to use it today, but I'm curious how much of Collection we really need here, because this is a good use case to consider when designing containers that support these non-constraints).

1 Like

The canonical way of doing this is to represent the outstanding I/O on the "user side" with a heap allocation. Thus:

What if this was an UnsafePointer<T> and your request/response types were generic over this T? Or you could even require that T: AnyObject, and then use the address of the T as the context; the user of the library will then use class instances to represent in-flight I/Os. Something like this would make the raw API slightly more convenient to use without the ceremony of casting these values.

I'm interpreting the underlying semantics here as "the ring owns the buffers/fds" (it is of course a little fuzzier than that in practice because C doesn't model ownership), so I think the approaches that would work are:

  • you can ask the ring to borrow a particular buffer/fd directly ("the ring is the collection")
  • the ring provides a "view" into the resources it owns (what I've done here)
  • the ring takes ownership of the resource but the user is supposed to keep track of things themselves

Sequence likely wouldn't be sufficient because typically you're going to be receiving completions out of order due to concurrent execution, so there's no obvious "the next buffer I'll need" (plus multiple completions may refer to the same resource, e.g. a series of operations on the same registered fd).

2 Likes

Many, perhaps most, of the "unusual" API choices here are intended to avoid heap allocations. For example, the callback-based bulk completion wait API is because I don't want to allocate something to return the completions in. We've seen with AsyncBytes that going from O(# of IO operations) allocations to O(1) allocations is a dramatic speedup in practice.

I think it needs to be a little more overloaded than this. For example, a typical non-pointer-y thing to stash in the context is the index to use in the registered files or buffers views.

1 Like

This is great stuff. I still think we should save higher level waiting interfaces in general for a followup proposal, but if we can figure out how to handle cancellation better that would be a great improvement. I'll try to read over all this next week and see what I can adapt.

3 Likes

IORingSwift has been around for a couple of years now ;) We're using it in our embedded audio product for UART, SPI and I2C as well as socket and file I/O.

Public API provides structured concurrency wrappers around common operations such as reading and writing. Multishot APIs, such as accept(2), which can return multiple completions over time return an AnyAsyncSequence. Internally, wrappers allocate a concrete instance of Submission<T>, representing an initialized Submission Queue Entry (SQE), which is then submitted to the io_uring. Completion handlers are handled by having libdispatch monitor an eventfd(2) representing available completions. The user_data in each queue entry is a block, which executes the onCompletion(cqe:) method of the Submission<T> instance in the ring's isolated context.

let clients: AnyAsyncSequence<Socket> = try await socket.accept()
for try await client in clients {
    Task {
        repeat {
            let data = try await client.receive(count: bufferSize)
            try await client.send(data)
        } while true
    }
}

Plenty of other examples here. Original discussion here.

Things I would do differently? The use of classes does cause a lot of allocations (although this hasn't been a bottleneck for us). And, SE-0461 will make it easier to avoid the global actor.

5 Likes

Out of curiosity, how have you found the performance of this approach? I wrote a little sample app that used it and was not entirely happy with it, although sadly perf was broken on my particular Linux install so I didn't have much luck digging into it.

My unsubstantiated guess was that I was seeing overhead from too many syscalls on the eventfd, in which case exposing IOSQE_CQE_SKIP_SUCCESS could reduce the overhead.

The biggest bottleneck exposed by perf was Codable, but now we've eliminated that from the hot paths, there are indeed quite a lot of context switches. It's under investigation.

2 Likes

As well as SQPOLL support would be nice in this regard.