Subscripting Array with short indices

(a few days ago there was some lively discussion about relaxing type requirements for array indices which the forums seem to have lost, so this is my attempt to restart that discussion.)

it’s well known that swift is opinionated about standardizing on Int as an index type, which is why we cannot directly subscript Array with UInt, Int32, UInt8, etc. i don’t really disagree with this idea in general; i’ve always hated how C brackets all the integer types.

but there is one downside to Int, compared to say, Int32, in that Int has a huge stride: 8 bytes!

obviously, there are a lot of situations where this is not a problem at all:

  • Int uses more memory that Int32, but it’s highly uncommon for Int itself to be your memory bottleneck, so using Int fields for in-memory structures is usually okay.

  • Int can use a lot more storage than Int32 when serialized to a text-based format like JSON, but it almost never does, because JSON is a variable-length encoding:

    {"_id":1}
    

    uses the same amount of storage regardless of original bit width. so serializing Int to JSON is usually okay.

so, Int is good for in-memory stuff, and Int is good for JSON-encoded stuff.

but… Int is not good for binary-serialized stuff, like BSON, because BSON is a fixed-length encoding, and always uses 8 bytes to encode Int, which is twice as much storage as it uses to encode Int32.

i suppose one thing we can do is to use Int exclusively from the swift side, and narrow the Ints to Int32 when encoding to BSON. but now we’ve lost the compile-time guarantee of not trapping/throwing on encode, and (unlike decoders), encoders are not supposed to trap, or really, fail at all.

another approach could be to implement a more intelligent encoding + decoding strategy for these Int-typed fields, that encodes Int32 if it fits, and Int64 otherwise. but this adds a lot of complexity to the encoding + decoding logic just to serialize a simple Int field. MongoDB is totally cool with this (it brackets Int64 and Int32 when executing queries), but it’s a nightmare to handle in the strongly-typed swift world, because you have taken something that was a scalar type in your database schema, and turned it into a sum type with two cases.

i could also use a property wrapper to make the Int32 “look like an Int”, but this presents some challenges for algorithms that mutate the field, and if we really think about it, making the Int32 behave like an Int is a non-goal, the only reason we are pretending it is an Int is to be able to subscript an Array with it.

what are we to do about Int?

4 Likes

I'll just point out that not all binary formats encode integers using a fixed size. MessagePack doesn't really care about the size: it'll be encoded using the necessary bytes for the value and nothing more. Protocol buffers will use variable sized encoding too when the schema contains an int32 or int64, unless you request specifically a fixed32 or fixed64 type. BSON is optimized for other use cases I suppose since it doesn't do that.

That said, I was working with protocol buffers (swift-protobuf) recently and it turns out they never use Int anywhere. Depending on the schema, you'll work with one of Int32 or Int64 (or their unsigned counterparts). In a way that's good because you'll never fall into a dead end (trap) on a 32-bit platform when reading a value too big, so I'm not complaining. I adapted most of the related application logic to use either Int32 or Int64 internally so there's almost no conversion needed. And I don't decode any array index anywhere in this code base so it helps too. It's possible working on an app rather than a library gives me some liberty here.

In my case, I do have a bound-checked overload I use for any index from user input. And this makes me wonder: are you validating your indices coming from the database records or are you allowed to trap if something is out of bounds? Maybe you could introduce a labeled bound-checking subscript for Int32 (and perhaps others) that can handle those errors more gracefully and use the type mismatch to your advantage as a hint that you need to deal correctly with out of bounds errors here.

2 Likes

so, the workaround i’m currently using is to define wrapper types around Array, and these wrappers often do have a bounds-checked optional subscript.

@frozen public
struct Shoes
{
    public
    var elements:[Shoe]

    @inlinable public
    subscript(index:ShoeIndex) -> Shoe?
    {
        ...
    }
}

but the problem is this Shoes wrapper really isn’t justified in its existence, for all intents and purposes, it is just a [Shoe] array with an extra subscript. and that leads to some strange usage patterns. you subscript the Shoes instance directly:

if  let shoe:Shoe = shoes[index]
{
}

but you append to its elements property:

shoes.elements.append(contentsOf: moreShoes.elements)

you iterate the elements property:

for ... in shoes.elements
{
}

et cetera, et cetera.

now with enough effort, you can hide the elements buffer entirely by duplicating enough of Array’s API and (RandomAccess)Collection conformances, but at that point your Shoes type has grown far larger than is proportional to the amount of value it is actually providing. because all it really is is an array with an extension constrained where Element == Shoe, were it not for the fact that that is a terrible place to add API.

It's perhaps easier than you think to implement RandomAccessCollection because most of the requirements have a default implementation. Here's how I'd do it:

struct List32<Object>: RandomAccessCollection, MutableCollection {
	var objects: [Object]

	public subscript(position: Int32) -> Object {
		get { objects[Int(position)] }
		set { objects[Int(position)] = newValue }
	}
	public var startIndex: Int32 { Int32(objects.startIndex) }
	public var endIndex: Int32 { Int32(objects.endIndex) }
}

It's still missing some Array-specific functions, but if this list type is generic you only have to do it once and can reuse it for different element types.

I'm not too sure if the subscript returning an optional fits well with the RandomAccessCollection protocol though, because that would make the element type optional. Perhaps it should be a supplementary named subscript, such as list[boundChecking: myIndex], like the one I use myself:

extension RandomAccessCollection {

	/// Return element at `index` or `nil` if `index` is out of bounds.
	///
	/// Unlike the normal subscript, this one will check the provided index is within bounds and return
	/// `nil` if not. This should be used to provide a fallback path when an out of bounds `index`
	/// is expected and requies special handling.
	/// ```
	/// // use a default value when out of bounds:
	/// let value = data[boundChecking: index] ?? 0
	/// ```
	public subscript (boundChecking index: Index) -> Element? {
		indices.contains(index) ? self[index] : nil
	}

}

If you fear adding an extension to a standard library protocol might cause clashes, you can restrict it to your own list type too.

2 Likes

yeah, so the reason i don't like unifying the Array wrappers into a generic type is because there exists a lot of API that only exists for a particular element type. for example, there could be a lot of API for List32<Shoe> and a lot of completely unrelated API for List32<Bag>, etc etc.

this is not an implementation problem, this is an organization and documentation problem. because List32<Shoe>, List32<Bag>, etc. aren't real types that can be documented or curated, they are just different flavors of the same generic List32 type.

this is also essentially the only reason why i dislike adding API to Array<Shoe> itself, because such API cannot be curated or documented separately from everything else living in Array's namespace.

if Array<Shoe> were a "real" curateable type, there wouldn't be much point in going through all the List32<Shoe> ceremony in the first place.

I suppose if you want a different type that can be documented separately from Array, you should use a protocol to make things generic:

protocol List32: RandomAccessCollection, MutableCollection {
	associatedtype Object
	var objects: [Object] { get set }
}

extension List32 {
	public subscript(position: Int32) -> Object {
		get { objects[Int(position)] }
		set { objects[Int(position)] = newValue }
	}
	public var startIndex: Int32 { Int32(objects.startIndex) }
	public var endIndex: Int32 { Int32(objects.endIndex) }
	// some other Array-like APIs like insert and remove, operating on `objects`
}

// "Real" type for a list of shoes:
struct Shoes: List32 {
	var objects: [Shoe]
	// all the methods of List32 are included in Shoes
	// some other Shoe-related APIs can be added
}

there’s two problems with this; the first (and lesser) problem is that objects needs to be public, so we have two parallel and largely redundant sets of API on self and self.objects.

the second, and in my mind, much greater problem, is every access to the List32.subscript(_:) is O(n), because it triggers copy on write, see: Performance implications of declaring Array/Dictionary property requirements in a protocol - #2 by Ben_Cohen

what is needed is a way to declare a { get _modify } accessor requirement in the protocol, but there is no way to do that yet.

perhaps sometimes the optimizer can cut through this, but i don’t know of a good workflow for investigating it if can in a particular situation, because you can’t godbolt multi-module setups. (which is my usual go-to strategy for checking if the compiler can optimize something.)

I suppose this would work, but there's a little more boilerplate to fill for each "real" type:

protocol List32: RandomAccessCollection, MutableCollection {
	associatedtype Object
	var objects: [Object] { get }
	mutating func modifyObjects(_ task: (inout [Object]) -> ())
}

extension List32 {
	public subscript(position: Int32) -> Object {
		get { objects[Int(position)] }
		set {
			modifyObjects { objects in
				objects[Int(position)] = newValue
			}
		}
	}
	public var startIndex: Int32 { Int32(objects.startIndex) }
	public var endIndex: Int32 { Int32(objects.endIndex) }
	// some other Array-like APIs like insert and remove.
}

struct Shoe {}
struct ShoeList: List32 {
	var objects: [Shoe]
	mutating func modifyObjects(_ task: (inout [Object]) -> ()) {
		task(&objects)
	}
	// some other Shoe-related APIs
}
1 Like

yeah, that would work (although i really wish we could chain the _modify so we could also perform efficient mutations on the element through the subscript.)

shoes[index].prices.append(50)
//           ^~~~~~ still copies `prices` because it copies the element

i suppose this is just one of those pick your poison situations…

Maybe this would work too:

protocol List32: RandomAccessCollection, MutableCollection {
	associatedtype Object
	var objects: [Object] { get set }
}

extension List32 {
	public subscript(position: Int32) -> Object {
		get { objects[Int(position)] }
		_modify {
			var temp = objects
			objects = [] // wipe temporarily to avoid copy-on-write
			yield &temp[Int(position)]
			objects = temp
		}
	}
	public var startIndex: Int32 { Int32(objects.startIndex) }
	public var endIndex: Int32 { Int32(objects.endIndex) }
	// some other Array-like APIs like insert and remove.
}

struct Shoe {}
struct ShoeList: List32 {
	var objects: [Shoe]
	// some other Shoe-related APIs
}

Note that I haven't verified what the compiler produces for any of this, just that it compiles.

i will say the completely unspecialized machine code does not look great, but i would expect it to look better if 1) the subscript were inlinable and 2) the conformance were declared in the same module as the conforming type.

i really wish there were a way to actually get some visibility into what the compiler would do in this situation.