RandomAccessCollection without O(1) efficiency

Terje · December 3, 2022, 12:43pm

The documentation for RandomAccessCollection states:

in order to meet the complexity guarantees of a random-access collection,
either the index for your custom type must conform to the Strideable protocol
or you must implement the index(_:offsetBy:) and distance(from:to:) methods
with O(1) efficiency.

My collection of N elements is split into pages of n elements. To create an index I might have to load pages and/or walk the pages from the beginning. (i.e. a tree or linked list data structure stored on disk.)

So I have a collection that can't exactly guarantee O(1) efficiency for these functions:

func formIndex(before: inout Self.Index)  
func formIndex(after: inout Self.Index)
func index(Self.Index, offsetBy: Int) -> Self.Index

because it either requires disk-io or walking the collection depending on the advancement/distance.

But it can do better than stepping over every index. I can skip entire pages because I can easily retrieve the number of elements on a page.

However O(1) efficiency could be important for sorting or finding elements in a sorted collection or whatever fancy algorithm expects it.

So I am debating whether I should conform to RandomAccessCollection or not.
The question I have is whether I will run into unnecessary trouble with the standard library by lying to it.
Maybe it will then choose the wrong algorithm for something?

So I had a look at the stdlib but it seems to me like most of the time formIndex(after: ...) is used or it just iterates over the entire sequence.

Is this more a documentation issue, thus a promise to the user/myself. to reach some level of efficiency?

Form Collection

Types that are not able to guarantee this performance must document
the departure, because many  collection operations depend on O(1) 
subscripting performance for their own performance guarantees.

I tried searching for a similar question or related comment but didn't find anything.

itaiferber · December 3, 2022, 3:14pm

Terje:

The documentation for RandomAccessCollection states:

in order to meet the complexity guarantees of a random-access collection,
either the index for your custom type must conform to the Strideable protocol
or you must implement the index(_:offsetBy:) and distance(from:to:) methods
with O(1) efficiency.

The beginning of this quote is important:

The RandomAccessCollection protocol adds further constraints on the associated Indices and SubSequence types, but otherwise imposes no additional requirements over the BidirectionalCollection protocol.

RandomAccessCollection is not a protocol which adds further function/type requirements to the underlying type, but is a semantic marker protocol that intends to document "this type can give you O(1) access to random elements". Its primary usage is as a generic constraint for algorithms that depend on that type of access pattern to be efficient.

This is the crux: nothing will go overtly wrong if you implement RandomAccessCollection on a type which doesn't offer random access to elements (and nothing could even check whether you're not violating your promises); but indeed, algorithms which expect O(1) efficiency of a certain operation to meet their own algorithmic complexity guarantees can no longer do so. An O(n) loop performing an operation assuming that underlying access is O(1) can easily become "accidentally quadratic", or O(n^2), if it turns out the underlying operation actually also takes O(n) time.

These types of violations can be really tough to spot. For small datasets of size n, there's not much difference between O(n) and O(n^2) — and very often, inefficient algorithms slip into production, at which point you may discover that for "real world" values of n, everything slows to an absolute crawl.

I would lean toward not adopting RandomAccessCollection on such a type. There's no reason to lie about the capabilities of a type, and if it can't offer O(1) traversal of indices, that's okay. There are probably plenty of operations that you can expose on the type that do operate in O(1) time, but those operations are almost certainly exposed via the Collection protocol itself.

Terje:

Form Collection

Types that are not able to guarantee this performance must document
the departure, because many  collection operations depend on O(1) 
subscripting performance for their own performance guarantees.

I'll add that for Collection, a large protocol with many requirements, it's tough to find a balance between abstracting over many different types, and offering specific performance guarantees. There's something to be said for documenting "this should be O(1) access, but it may not be for specific types"; and for those specific types, it might be something you want to handle on a case-by-case basis.

But for a protocol whose nature is guaranteeing specific algorithmic complexity, I'd tend to stick to those requirements pretty diligently.

michelf · December 3, 2022, 3:56pm

I wouldn't consider disk IO as part of algorithmic complexity, but as constant overhead. Walking the collection definitely is part of it however.

Terje · December 3, 2022, 4:17pm

Ok. Thank you for the clarification.
I will not adopt the protocol.

Terje · December 14, 2022, 2:31pm

Some thoughts, if anyone is interested.

Storing the number of elements of a list or table within the meta data implies that the meta page gets updated with every insert/remove. That would conflict with concurrent transactions since each transaction would hit the meta page. So don't store the count.

The meta page stores only the head and tail pages for a list or a root page for a BTree. Note that the leafs of BTree from a double linked list. Since I de-facto have a list, I might as well add support a list type in itself.

Now how about those Swift collection protocols:

Sequence should not be a problem. Just return nil in next() when there are no more elements.
Unfortunately the next() function doesn't throw. No nice for-loop sugar.
Nil is fine if the end is reached but if a read error occurs in between the caller might assume the database file is still fine rather than potentially (partially) compromised.

How about a (forward) Collection?

The startIndex is known since the head is known:

struct Index 
{
   let total: Int     //index of element in entire collection 
   let page:  UInt32  //index of page
   let local: UInt16. //index of element on page
}

The endIndex is encountered after walking the entire collection.

Collection expects O(1) for its startIndex
Collection does not require O(1) for count
Collection expects O(1) for its endIndex
startIndex + count would after all be O(n).

Can this be improved?
The count can be slightly improved by walking the pages. Each page can store < UInt16.max elements.
So count is reached in O(n_pages). But that's not good enough, is it.

The endIndex is 1 greater than the last valid index. So add a marker:

struct Index 
{
   let total: Int     //index of element in entire collection 
   let page:  UInt32  //index of page
   let local: UInt16. //index of element on page
   let end:   Bool    //marks the end of the collection
}

Collection expects O(1) for its endIndex
Moreover this means index can conform to Comparable

Collection expects `index(after:). There is some overhead but Big-O wise it's O(1) (no? Overhead shouldn't count. See comment by #michelf)

A list that can't be edited is pretty useless.
However inserting elements can invalidate an existing index when:

the inserted element has a lower total index
the element is inserted into a same page which is full.
The full page is replaced by two newish pages over which the new and existing elements are distributed-ish. Anyway, page and local are now invalid.

That's problematic. How can I insert an element at index I and then expect to lookup the element with that index if the underlying page has been moved.

Searching the forums I saw comment that Array with its Int index can be misleading in how the Collection protocols can be used. Yes it can be
I already had a look at String.Index but a table is more alike to Dictionary. Peeking at Dictionary.Index. I see age and HashTable.Offset. Not sure about the latter one but the meaning of the first one is easy to infer.

I kept a dictionary index around, added a lot of elements (which led to a new age), and promptly got a runtime error when I tried reusing the index.
Likewise String docs states that indices should only be used with the string that created them.

I shouldn't store indices longterm and thus nobody should expect me to support that either.
Conclusion: Indices are ephemeral. slight_smile:

How does that help me? I add an age to my index and list. Each time the list is updated, the age is updated. An index with a different age is invalid.

struct Index 
{
   let total: Int     //index of element in entire collection 
   let page:  UInt32  //index of page
   let local: UInt16  //index of element on page
   let end:   Bool    //marks the end of the collection
   let age:   Int     //marks an age
}

Though should it be completely invalidated?
Like with an array I still want to use that same index used for insertion to lookup the element. But unlike an array, I can't be certain if the page is still valid.

Thus the total is still viewed as valid information. In effect this creates a fast and slow path when fetching an element:

func element(at index: Index) throws -> Element
{
  try index.age == self.age ? 
  element(at: index.local, on page: index.page : element(at: index.total)) 
}

//slow path: lookup the nth element by walking
func element(at index: Int) throws -> Element { ... }

//fast private path: lookup the nth element on the given page 
func element(at index: Int, on page: Int) throws -> Element { ... }

However if I previously cached an index with a greater total, that element would have shifted and I would get the preceding element back. But that's the same for Array. So:

Collection expects O(1) access to the element (within the same age)
Also index(after:) is still ok because it's just total + 1.

Unfortunately, given the slow path, subscript is now:

How about BidirectionalCollection
I have an endIndex but I can't create the index before with O(1) complexity.
However I really want this because when I want to have the last n elements I don't want to go through the entire collection.

I'll add another marker to indicate where the index started from:

struct Index 
{
   let total: Int     //index of element in entire collection 
   let page:  UInt32  //index of page
   let local: UInt16  //index of element on page
   let end:   Bool    //marks the end of the collection
   let age:   Int     //marks an age
   let anchor: Anchor //enum: head, tail -> marks total == 0 as the head or the tail
}

I can now start at either end and walk the collection.
So index(before:)

There is stil a problem though.

With the exception of the start and end indices, I can't order two indices which originated from opposite ends unless I know the total count. Comparing such indices is undefined behaviour.
I can either throw an error or (because it's not really an issue with the file) fail with a runtime error.

How about RandomAccessCollection
It requires index(after i:.Index) -> Index with O(1) complexity and that's just not possible (as explained by #itaiferber). And that's okay.

Summary:
Sequence:
Collection:
BidirectionalCollection:
RandomAccessCollection:

In the end it is not advisable to conform to the Collection types or expect O(1) efficiency.
However I do think that with this index type I can walk the collection from either end, find indices and elements that match some predicate and look them back up again within reasonable time.

It's only by trying to conform to the Collection types that I am considering/encountering all of these aspects.

One more thing: about that Index type.
String.Index is just a Int with a bunch of bit masks. I assume that's more efficient because it fits a register?

The max element number is UInt48 (yes that type doesn't exist). The markers require only a bit while Bool is an entire byte.

So I am heading toward this:

struct Index 
{
   private let a: Int 
   private let b: Int 
 
   var total: Int     { 48 bits of a }
   var end:   Bool    { 1 bit of a }
   var anchor: Anchor { 1 bit of a }
   var age:   Int     { 14 bits of a }

   var page:  UInt32  { 32 bits of b }
   var local: UInt16  { 16 bits of b }
   var count: UInt16  { 16 bits of b } //number of elements on local page
}

Inserting UInt16.max elements while the age only goes up to 14 bits could be an issue.
After 14 bits, it's always the slow path. And the collection type should not escape the transaction anyway. Adding more 14 bits of elements in one transaction is gonna bi slow anyway. At which point do append not insert.

But I probably change my mind a few things more before this settles down.
Thanks for reading. Typing it out helped me.

PS. It's also interesting to go through Foundation. I never expected to see FixMe's and ToDo's in code like that. Kinda nice/reassuring to know.