Choosing between switch over types and visitor pattern

Terje · August 18, 2022, 11:13am

Background: I'm implementing a database in pure Swift from the ground up.

DataCollection is a bag of bytes which stores the underlying bytes of some Swift types like String or Int.
For example a page is a DataCollection and a cell is DataCollection spread over out over multiple pages.

Likewise some "Primitive" types like String or Int can be written to a table row. Eventually those bytes end up in a cell. A row has a header for housekeeping which tracks the size of a string or whether it's a nil and so on.

More "complex" values are converted to a "primitive" type before they are stored.
Date -> Double -> UInt64 as in Date -> #seconds -> endianness-respecting bytes

I have a Writer and a Reader protocol for rows.
Reading can be done on demand because reading a boolean should not require reading a wall of text as well.
Writing is done in its entirety because the columns are not fixed width across writes.
A value can end up in multiple rows (the table, a view on that table, index tables,..)
Therefore I want to encode a column value once and write it multiple times.
For other reasons too a value is encoded into a header and payload. So I have an array of elements:

struct Element
{
	let header: Encodement //encodes info about the stored value
	let payload: Primitive?
}

At some point I actually have to write the primitive to the collection.
I have three options: switching over the type or a visitor pattern.

First option:

protocol Primitive {}
switch element.payload 
{
  case let value as String: try collection.write(raw: value) //raw: don't include string length
  case let value as UInt8: try collection.write(value: value)
  ...
}

Second option:

protocol Primitive
{
	func write(to collection: any DataCollection)
}

extension FixedWidthInteger where Self : Primitive
{
	func write(to collection: any DataCollection)
	{
		try collection.write(value: self)
	}
}

extension UInt8:  Primitive {}
extension UInt16: Primitive {}
extension UInt32: Primitive {}
extension UInt64: Primitive {}

A third option is sticking the payload in an enum.

enum Payload : Equatable
{
	case none
	case blob(Data)
	case text(String)
	case numeric(any FixedWidthInteger)
    ...
}

Benefit of this is I can't forget some types. The compiler will complain.
But it uses a little more memory (for the enum).

In number of code lines I doubt one or the other makes any difference. It's either (casting in) a few switch statements of implementing a protocol for a few supported types.
In the end any of these can be made to work.

So I am not really asking what I should pick or how to make it work. I am asking though how you decide on any of these options. That would be more instructive.

How, not what, do you pick?
Does this fall under premature optimisation and does it just boil down to a matter of style?
Is one approach "better" or more Swifty than the other?

PS. Overall a swift struct goes through a custom encoder which generates an element dictionary which is then sorted according to a schema, etc.

vitamin · September 1, 2022, 11:04pm

The only un-Swifty thing I see here is the enum case called .none in third option's enum. In a code review, I would usually flag that as a code-smell because it's cleaner to use an optional variable at the next level up, but that's just my opinion.

Aside from that, I don't know if there's a "Swiftiest" option here. But given these options, I'd prefer a modified form of the second option (visitor pattern) for several reasons:

Using a protocol to describe functionality has lots of advantages compared to using a protocol to define variables of a data type. For example, it makes it simple to do type erasure and mocking because the protocol requirements don't have generic parameters or variables.
It makes it possible for any consumer of your library to extend the functionality by adding new conformances to the protocol without having to modify the core code of the library itself. (You lose this benefit if your core code relies on a switch or enum.)

Of course, if you want to limit behavior to some compile-time-specified set of options, then an enum or switch might be called for. The problem with switching over types with case let as though, is that you're always going to have a default: case to worry about, and you don't run into that with the second option.

For the second option though, I would have the following suggestions:

Name the protocol Writable instead of Primitive. (Try to name your protocols according to what one job the protocol is describing the capability to do. According to Single Responsibility Principle, a protocol should stay focused. Then you can use protocol compositions later if you need to describe the ability to do more than one job.)
Take advantage of generics, e.g.:

protocol Writable
{
    /// Writes a representation of `self` into immutable store `S`,
    /// returning the updated store after the value was successfully written.
    func writing<S: Storing>(to store: S) async throws -> S
}

extension Float: Writable
{
    func writing<S: Storing>(to store: S) async throws -> S {
        try await store.write(self)
    }
}

Terje · September 3, 2022, 10:56am

Thank you for your feedback Jon.

Try to name your protocols according to what one job the protocol is describing the capability to do.

The naming is probably one of the reason why I disliked my code and kept going back and forth between the three options mentioned. Thank you for making me realise that. Writable is too generic/vague but RowWritable might do. I also like the generic version of it.

So how did I get to that name Primitive?

The file being accessed by my module could be seen in a few different ways:

As a collection of pages (bag of bytes). The layout is up to the user. Only paging is done by me (the module).
Some layout is given to the pages. One page type is a collection of cells. Cells can overflow onto other pages. Reading/Writing to a cell is done via a `CellConvertible' protocol.
As a B-Tree. The cells are now split into key-value pairs. Node and leaf pages take care of finding those pairs. The key is comparable and CellConvertible', the value is CellConvertible'.
As a database. A schema is (to be) stored with the file. The cell value is an ordered collection of columns.
As a graph with on-demand fetching of relationships. A graph is isolated to an actor. Backed by a database which is an actor as well.

The Primitive protocol appears during item 4.
How can content end up in the database? Provide an ordered array of column values, a dictionary of column values or a type conforming to Codable. The last two use the schema to sort the values by column name/index.

Anything can be written into the file using a custom page or CellConvertible. But if it's interpreted as a row in a database (i.e. Item 4) I can view it as a set of columns. Using this interpretation I can then optimise read and write access. For example a row is preceded by a header. Like column 5 in this row is 100 bytes of text and column 6 is nil. I know the location of column 5 and can skip that if I want to partially read the row.

The header value of each column is backed by enum. The cases can have an associated value representing the byte length of the underlying value.

For example: a date can be converted to a double (could be done by the user or my custom Encoder type). This double can be viewed as UInt64 bit pattern (conversion done by the user or my encoder). The UInt64 is the most basic type (that I can think of), so I have to support that if I want to represent the file as database with some schema. (Note: arbitrary bytes of known size can be stored. So an image can be converted to a bag of bytes by you and stored within the file if need be.)

The header for a UInt64 is numeric(8). A string "foo" would be text(3). I choose an enum to ensure unique values. These cases are converted to an Int representing the type + payload length. Modulo calculus is used because strings and blobs (binary data) are of arbitrary length. This kinda prevents custom user types but enum's aren't extendible anyway. Because of this enum header I had a corresponding enum for the payload as one of the options mentioned.

if you want to limit behavior to some compile-time-specified set of options

Given the wall of text above, I think I have to limit this. I need to encode some type information into the header and I need to know how to write its bytes into the cell. Any type that can converted into a primitive, hence the protocol name, is supported. But to support every possible type as a primitive is not feasible or necessary I think.

I don't think there is any practical limit as to what can be stored this way. Provided the user conforms to Codable and I support the correct set of primitives. I have FixedWidthInteger, Strings, Data (if that ever becomes Sendable), bool.

The problem with switching over types with case let as though, is that you're always going to have a default: case to worry about,

Is that really that much of an issue? I can just throw an unsupported type error.
Not I am a huge fan of this version. It seems like I threw out too much type information too soon and I have to recover it here.

Since posting I have re-read the notes about Swift performance. I think static dispatch is used for the protocol. That might be inlineable? That would be nice. Switching over types is dynamic, no? So slower I guess?

To everyone that made it this far: thank you for reading.

gwendal.roue · September 3, 2022, 3:59pm

TL;DR

Public protocols can't have internal customization points, and it's a pity.
The visitor pattern does not always scale well.
Enums rock.

During the many years of development of GRDB the SQLite library, I had to support many kinds of "expressions" (values, columns, compound expressions such as Column("a") < Column("b"), and many others).

I went through several refactorings, and settled on:

A public but opaque type that wraps a private enum, with internal factory initializers and instance methods.
A public protocol that can build the opaque type from any other type (user types, GRDB types, as well as standard types when relevant).

// Public opaque type
public struct SQLExpression {
    // Private implementation
    private var impl: Impl
    private enum Impl {
        case value(...)
        case column(...)
        ...
    }

    // Internal factory methods
    internal static func value(...) -> Self {
        self.init(impl: .value(...))
    }

    internal static func column(...) -> Self {
        self.init(impl: .column(...))
    }

    ...
}

extension SQLExpression {
    // A lot of internal instance methods
    internal func frobnicate(_ context: FrobnicationContext) -> Frobnication {
        switch impl { ... }
    }
}

// Public protocol that produces SQLExpression
public protocol SQLExpressible {
    var sqlExpression: SQLExpression { get }
}

// Public expressible types
public struct Column: SQLExpressible {
    var name: String
    var sqlExpression: SQLExpression { .column(name) }
}

extension Int: SQLExpressible {
    var sqlExpression: SQLExpression { .value(self) }
}

The role of the factory initializers is mainly to prevent the creation of invalid values of the private enum. Some cases have invariants that can't be expressed by the enum type itself (such as a case whose payload is an array that must not be empty, for example).

That's a lot of boilerplate, but I'm pretty happy with it.

The pollution of public apis is minimal (an sqlExpression property that does not bother anybody). All other apis have the desired visibility (public, internal, private).

What did I try before? A lot of painful solutions

Why not a public protocol with one type per case?

For two reasons:

If you don't want to cast at runtime (case let value as Column, etc.), with the risk of forgetting a case, then you have to rely on customization points.

Public protocols can not define internal customization points. So the frobnicate method above has to be public, as well as the FrobnicationContext and Frobnication types.

The generally accepted technique for public apis that should have remained internal is to prefix their names with underscores: _frobnicate, _FrobnicationContext... The problem is that those ugly underscores proliferate, and you end up with a lot of public types for no good reason.

None of those problems happen with the enum-based technique suggested at the beginning: all apis that should remain internal remain internal.
In my particular case, it wanted to compare all the various implementations of _frobnicate across the types that implement it. Due to the file-scoping of private and fileprivate apis, all those implementations were dispatched across many files, and I it was difficult to compare them in a glance.

Again, this problem does not happen with the enum-based technique: all implementations are a big switch on the cases of the private enum, easy to navigate.

Why not the visitor pattern?

Some of the problems of public visibility expressed above still apply, but at least MOST of public/internal methods above remain internal:

public protocol _SQLExpressionVisitor {
    mutating func visit(column: Column)
    mutating func visit(value: ...)
    ...
}

public protocol _SQLExpression {
    func _accept<Visitor: _SQLExpressionVisitor>(_ visitor: inout Visitor)
}

// Public expressible types
public struct Column: _SQLExpression {
    var name: String
    
    public func _accept<Visitor: _SQLExpressionVisitor>(_ visitor: inout Visitor) {
        visitor.accept(column: self)
    }
}

extension Int: _SQLExpression {
    public func _accept<Visitor: _SQLExpressionVisitor>(_ visitor: inout Visitor) {
        visitor.accept(value: self)
    }
}

// Internal apis
extension _SQLExpression {
    internal func frobnicate(_ context: FrobnicationContext) -> Frobnication {
        var visitor = Frobnicator(context: context)
        _accept(&visitor)
        return visitor.result
    }
}
private struct Frobnicator: _SQLExpressionVisitor {
    var context: FrobnicationContext
    mutating func visit(column: Column) { ... }
    mutating func visit(value: ...) { ... }
    ...
}

The visitor pattern is type-safe, there is no dynamic cast.

The amount of underscored apis is reduced.

All implementations for a given method can be located in the same file (the slight cost is that visited types must sometimes expose some private stuff).

So what's the problem with the visitor pattern?

The visitor pattern does not deal very well with type hierarchies. The GRDB SQLite wrapper does not only deal with SQL expressions, but also with selections, ordering terms, and various other members of the SQLite grammar. It happens that in order to deal with selections (such as the "*" in "SELECT * ..."), one must also deal with all expressions ("SELECT name, score + 1 ..."). This creates a hiearchy in the visitor protocols:

public protocol _SQLSelectionVisitor: _SQLExpressionVisitor {
    // All selection-specific visiting methods (plus inherited ones for expressions)
    mutating func visitAllColumns()
    ...    
}

Practically speaking, selection visitors have the same implementation for most expression-visiting methods, with a few exceptions. Repeating the same code again and again smells, so you try to define a convenience default implementation for all expressions:

public protocol _SQLSelectionVisitor: _SQLExpressionVisitor {
    mutating func visitAllColumns()
    ...
    // Convenience for all expressions
    mutating func visit(expression: _SQLExpression)     
}

// Default implementations
extension _SQLSelectionVisitor {
    mutating func visit(column: Column) { visit(expression: column) }
    mutating func visit(value: ...)  { visit(expression: value) }
    ...
}

private struct SomeSelectionVisitor: _SQLSelectionVisitor {
    // Thanks to the convenience, all expressions are visited here.
    mutating func visit(expression: _SQLExpression) { ... }
}

But if one or two expression types require a special processing, the compiler will not tell you, because the catch-all convenience method fulfills the compiler needs. It is thus very easy to forget. Of course, one can rely on tests, but the compiler is just faster to tell that something's wrong than runtime tests: nothing beats static checks.

So in the end I stopped adding the convenience catch-all method, and ended up with awfully long lists of identical or nearly identical methods.

Again, this problem does not happen with the technique of the private enum described at the beginning.

As a conclusion: enums made me happy, perhaps they will do the same to you.

vitamin · September 3, 2022, 4:48pm

That's for you to decide.

As to whether static dispatch is used, or whether it's inlineable, might depend on whether you're using the protocol as a generic constraint vs. as an existential. Hence why I suggested:

protocol Writable
{
    /// Writes a representation of `self` into immutable store `S`,
    /// returning the updated store after the value was successfully written.
    func writing<S: Storing>(to store: S) async throws -> S
}

extension Float: Writable
{
    func writing<S: Storing>(to store: S) async throws -> S {
        try await store.write(self)
    }
}

(Using generics tends to incur static dispatch. Someone correct me if I'm wrong there.)

vitamin · September 3, 2022, 4:53pm

True, but you can make a protocol composition from n public protocols plus n internal or private protocols. Use the internal or private protocols for customization points you don't want to expose to library consumers. No?

What made you decide against using a where clause to avoid duplication?

Terje · September 3, 2022, 7:29pm

Thank you for sharing your experiences.

I like enums too because the compiler reminds me of forgotten cases.
I also don’t like underscores but I do use them in a few places.
Writing the SQL parser and interpreter is the last major part I haven’t written any code for yet. Kinda dreading that at the moment.
Namespaces would be nice to have instead of the enum trick.
Supporting Codable already requires a lot of boiler plate. It seems somewhat unavoidable in this rather specific domain space.