New "Unevaluated" type for Decoder to allow later re-encoding of data with unknown structure

cherrywoods · March 18, 2018, 5:08pm

Currently it is impossible to ask a Decoder for some unevaluated data from it's storage. This problem arrises, if someone needs this data to re-encode it later, but does not know how it is structured. The problem was described in SR-53112.

A Unevaluated type has also been discussed in relation to this question.

I am opening this topic, because the discussion seems no longer directly relevant to this other problem to me.

Decode a JSON object of unknown format into a Dictionary with Decodable in Swift 4

[...] It’s possible to extend the containers (and add default implementations) to do this, but I think a slightly more elegant solution would be to just expose the Unevaluated type more generally:
public struct Unevaluated : Codable {
    public let value: Any
    public init(_ value: Any) { self.value = value }

    public init(from decoder: Decoder) throws {
        // throw a type mismatch
    }

    public func encode(to encoder: Encoder) throws {
        // throws an invalid value error
    }
}
The naming and scoping is something we’d need to figure out, but the idea is that this type can be shared among all encoders and decoders (and their containers). Just like some encoders and decoders intercept certain types to customize them (e.g. with encoding strategies), they would intercept Unevaluated as well. You would encode(Unevaluated(...)) and decode(Unevaluated.self) just as you would any other type. If the type is intercepted, the encoder/decoder does what it needs to to handle the inner value; if not, you’ll get the error thrown by default (“this encoder/decoder can’t handle Unevaluated”), which you can catch and decide how to deal with.

Aside from naming and deciding where to hang the type, I think this would be the easiest and least intrusive approach.

Decode a JSON object of unknown format into a Dictionary with Decodable in Swift 4

Getting back the underlying data from JSONSerialization is the exact idea behind Unevaluated — the difference between AnyCodable and Unevaluated is that AnyCodable attempts to decode types that it knows about on its own and is performing conversions; on the other hand, Unevaluated would be a marker type which asks the Decoder to stick whatever existing representation it has of the underlying data into its .value and returns that. If the Decoder supports Unevaluated (e.g. in formats where it’s possible to grab an underlying representation like NSNulls, Strings, etc.), it can do that; if the Decoder does nothing special to recognize Unevaluated and ends up calling its init(from:), Unevaluated will just throw a .typeMismatch letting you know that the Decoder doesn’t support it.

To make this concrete, the following implementation of unbox is how JSONDecoder handles taking an existing container and coercing it into the value you’ve asked for:
fileprivate func unbox<T : Decodable>(_ value: Any, as type: T.Type) throws -> T? {
    if type == Date.self || type == NSDate.self {
        return try self.unbox(value, as: Date.self) as? T
    } else if type == Data.self || type == NSData.self {
        return try self.unbox(value, as: Data.self) as? T
    } else if type == URL.self || type == NSURL.self {
        guard let urlString = try self.unbox(value, as: String.self) else {
            return nil
        }

        guard let url = URL(string: urlString) else {
            throw DecodingError.dataCorrupted(DecodingError.Context(codingPath: self.codingPath,
                                                                    debugDescription: "Invalid URL string."))
        }

        return url as! T
    } else if type == Decimal.self || type == NSDecimalNumber.self {
        return try self.unbox(value, as: Decimal.self) as? T
    } else {
        self.storage.push(container: value)
        defer { self.storage.popContainer() }
        return try type.init(from: self)
    }
}
The value passed in to the method is the value returned from JSONSerialization (e.g. NSDictionary containing NSString and NSArray); today, JSONDecoder knows about a few special types and intercepts them to reinterpret the data. If we were unbox(value, as: Unevaluated.self) today, we’d fall into that last case:
self.storage.push(container: value)
defer { self.storage.popContainer() }
return try type.init(from: self)
That last line would end up calling Unevaluated.init(from:), which would just throw a .typeMismatch. In order to support Unevaluated, we’d expand unbox to do this:
fileprivate func unbox<T : Decodable>(_ value: Any, as type: T.Type) throws -> T? {
    if type == Unevaluated.self {
        return Unevaluated(value)
    } else if ... {
        // ...
    } else {
        self.storage.push(container: value)
        defer { self.storage.popContainer() }
        return try type.init(from: self)
    }
}
This would just return an Unevaluated instances whose contents are exactly what JSONSerialization returned: the collection of NS values it decoded. When you get back the Unevaluated type, its .value contains exactly what you’re getting at. (This is the raw value access you’re looking for.)

Dynamic Member Lookup

cherrywoods · March 18, 2018, 5:44pm

I think dynamic lookup for Unevaluated would be a truly great thing. However, I have some concerns related to the implementation and the usability of the Codable environment for Decoder implementors.

My main question here is: How will this dynamic lookup on Unevaluated be implemented? Since .value would be specific to the actual decoder, it would be hard or even impossible to tell anything about it and provide dynamic lookup to it. We could do guesses here, or require that the representation matches the one from JSONDecoder and PlistDecoder. I don't think that would be good.

I see right now three ways to work around this:

let the implementor of the Decoder take care of this
Require that .value will (not literally) be a KeyedDecodingContainer, UnkeyedDecodingContainer or a SingleValueDecodingContainer. With something like that, we could do a lot of dynamic lookup (I guess).
Refer to the decoder that was asked to return Unevaluated (use decoder as a delegate) NOTE: It is actually a bit more complicated then I thought first: Unevaluated needs some sort of immutable snapshot of the decoder because decoders storage will change.

In case 1, the implementor would, as far as I can see, in essence have to write just another decoder here. He would need to supply pretty the same functionality twice, because he would need to write a Decoder and then something similar, that worked as a delegate for dynamic lookup of Unevaluated. Case 3 resolves this.

Case 2 is some approach to standardize the storage of a Decoder. Unevaluated would require that .value conformed to certain protocols like KeyedContainer, UnkeyedContainer and SingleValueContainer. I think this would be super cool, because it gets easier to write a Decoder then, if you rely yourself on Unevaluated and implement KeyedDecodingContainer, etc. over it, instead of implementing that logic yourself. If I did not look on it from this perspective, I think, it could also looks like you have to implement another keyed container thing instead of another decoder here. Interesting is that this correlates up to some extend with what I think is a way to make Decoder simpler to implement (I pointed this out a bit here, it's already implemented, please see the v2 branch of https://www.github.org/cherrywoods/swift-meta-serialization)

Case 3 would be easiest for the implementor of a Decoder I guess: You would not create a Unevaluated with the content on top of the storage, instead you would pass (EDIT: a immutable snapshot of) self to init. Unevaluated implements the dynamic lookup by somehow using the "traditional" methods, like the container method. Encoder used the decoder and looked at it's storage when re-encoding. Dynamic lookup would look somehow like this:

unevaluated.myCodingKey? as String?

However, this leads to another (pretty interesting, I think) question: Would Unevaluated actually be a alternative to Decoder for the user?
Considering this "traditional" decoding code:

init(from decoder: Decoder) throws {
    let container = try decoder.container(keyedBy: SomeCodingKeys.self)
    self.value = try container.decode(Int.self, forKey: .value)
    let nestedContainer = try container.nestedUnkeyedContainer(forKey: .otherStuff)
    self.one = try nestedContainer.decode(String.self)
    self.two = try nestedContainer.decode(Bool.self)
}

Will this do the same?

init(from decoder: Decoder) throws {
    let unevaluated = decoder.unevaluatedData()
    self.value = unevaluated.value // The property access can't throw, the lookup would return nil, I omit the handling here
    self.one = unevaluated.otherStuff.0 // if .0 will be possible, it could also be:
    self.two = unevaluated.otherStuff[1]
}

In my opinion, adding dynamic lookup would make a decoder (and also an encoder) more usable. But I do think, that this step should be part of a larger redesign of decoder and encoder, it should not be added in parallel, if I am not mistaking about the abilities of it. It could also be added to decoder directly.

For Unevaluated in general, I just see a minor issue.
From now on I will see Unevaluated just as a way to get "raw" storage data that will just be re-encoded and assert that Unevaluated is a struct like this one:

Decode a JSON object of unknown format into a Dictionary with Decodable in Swift 4

public struct Unevaluated : Codable {
    public let value: Any
    public init(_ value: Any) { self.value = value }

    public init(from decoder: Decoder) throws {
        // throw a type mismatch
    }

    public func encode(to encoder: Encoder) throws {
        // throws an invalid value error
    }
}

The abstract concept somehow suggests to me that I can also use this with other encoders than the one related to the decoder I got the Unevaluated from. I should of course not do this, but I could mix JSONDecoder and PlistEncoder and succeed in doing so, if the JSON only contained Dictionary, Arrays, Strings and Numbers and the implementation was this one:

Decode a JSON object of unknown format into a Dictionary with Decodable in Swift 4

fileprivate func unbox<T : Decodable>(_ value: Any, as type: T.Type) throws -> T? {
    if type == Unevaluated.self {
        return Unevaluated(value)

I don't think that this is a totally unrealistic scenario. I we have some structure that re-encodes something, when it is encoding, and does this on a totally generic level (is not specific to any serialization format), so why not give it to another encoder? Also, the unknown structure issue can also come up, if we want to transfer to another format and not back to the original one.

For those reasons, I would prefer Unevaluated to be a protocol rather then a struct.

With a protocol the implementor of decoder could also connect this with format specific lookup and manipulation support, e.g. with a JSON enum that conforms to Unevaluated (although format specific lookup seems not to be necessary to me, if there is dynamic lookup). If passing this to e.g. a PlistEncoder, one would get a clear error here, or JSON could even support such a cross over by implementing Encodable and encoding the way it is implemented in itaiferber's gist. Unevaluated could be documented as a good point to implement format specific lookups and manipulation.

One disadvantage of a protocol is that it won't be possible to call decode(Unevaluated.self) to get raw storage of a decoder to encode it later, as far as I can see. A method returning Unevaluated would work, but this would require all decoders to support it. However all decoders should have some sort of data they are working that they can pass back here.

Another disadvantage is that now Case 1 from above applies. Format specific lookup code still look verry similar on similar formats (e.g. JSON and msgpack) I think.

itaiferber · March 20, 2018, 3:16pm

Thanks for putting together some thoughts on the issue, and apologies on the delay — I had most of this typed out yesterday but didn't get a chance to finish. There's a lot to unpack here, so let's take a step back for a moment.

I think I should have gated this statement a bit better, and with some thought, I've changed my mind a little. I think adding dynamic lookup to Unevaluated would be a great addition if we

Identify a clear use-case for adding the feature
Decide on beneficial semantics of the implementation that would integrate well with the existing API

Before tackling #2 here, I think we need to figure out #1 — is there indeed a use-case here that merits the potential complexity of the feature?

To reiterate what I said in the other thread,

Decode a JSON object of unknown format into a Dictionary with Decodable in Swift 4

The original request for Unevaluated, both partially here, and in SR-5311 is an answer to the currently intractable case of "there is a part of my serialized object graph that I know nothing about, and need to be able to decode without knowing anything about it, for the sake of preserving the values as-is." Usually this comes up when you are decoding parts of a payload whose structure is variable — you need some of the information from it, but not all, and the parts you don't need (and know nothing about) need to be preserved for re-encoding back later on.

The idea here behind Unevaluated is to give you that data, as-is, in a format that may or may not be opaque to you. You can later re-encode the Unevaluated instance, and ideally, get back the same representation that you had before.

This is separate from a request for returning a parsed response in a reasonably consumable format.

The primary goal for introducing Unevaluated is to solve the issue of not being able to decode data you know nothing about for the purpose of preserving it. At its core, the representation of this data will be opaque to you, since the purpose is not to consume the data but to preserve it for future re-encodes.

So before we decide on dynamic lookup or anything similar, we need to decide on whether making Unevaluated consumable in a reasonable way is something we want to do or not; all API needs motivation for its introduction, not motivation against its introduction. Questions:

What use-case does making Unevaluated consumable solve? Is there something you can do by getting an Unevaluated that you can't by using existing APIs directly?
What sort of patterns might we enable by making Unevaluated consumable?

I think the answer to #1 is "no" at the moment. Right now, I can't think of anything you can't decode by writing your own AnyCodable/JSON/what-have-you enum to decode arbitrary contents of a payload in a way that lets you inspect them in a type-safe way.

As for #2, I think your example shows how we might expect some folks to use the feature:

cherrywoods:

init(from decoder: Decoder) throws {
    let unevaluated = decoder.unevaluatedData()
    self.value = unevaluated.value
    self.one = unevaluated.otherStuff.0
    self.two = unevaluated.otherStuff[1]
}

Error-handling and casting aside, I don't think it would be unreasonable for many developers to flock to a potentially "easier" and less verbose way of getting at their values. Given two ways of doing the same thing with one of them being easier at the cost of some safety, I think many would understandably go with the easier option. In general, we try to avoid offering two ways of doing the same thing, especially when a major goal of the Codable design is to offer strong type safety for working with data that's generally not typed; undermining that goal won't be much help.

So, if we're looking to add dynamic lookup to something like Unevaluated, let's motivate it — is there a problem that we would be solving (in a format-agnostic way)?

Keep in mind, again, that this is in contrast to offering your own, say AnyCodable type (possible today) which would allow you to do something like this:

init(from decoder: Decoder) throws {
    let container = try decoder.singleValueContainer()
    let stuff = try container.decode(AnyCodable.self)
    guard case .dictionary(let dictionary) = stuff,
          case .array(let array) = dictionary["otherStuff"] else {
        // throw
    }

    self.one = array[0]
    self.two = array[1]
}

If you do offer your own type, you can also definitely add dynamic lookup to make it more dynamic too.

cherrywoods · March 20, 2018, 5:27pm

No need to apologise, I'm not in a hurry .
Thank you for giving my thoughts API relevant context!

I actually can't see a specific use case. This is also part of what I wanted to point out: I think we can do all this with a decoder or a container.

Right now, I would want to ask another question for Unevaluated in general:

What I have in mind here is motivated by this thread: Using Unsafe pointers to manipulate a JSON Encoder

The question is: Using Unevaluated, would there be a way to get only the parts of the underlying data as Unevaluated, that I haven't evaluated? Less abstract:

init(from decoder: Decoder) throws {
    let container = try decoder.container(keyedBy: /*some CodingKey enum*/)
    let name = try container.decode(String.self, forKey: /*some key*/)
    let remaining = try container.remainingUnevaluatedData() // would this be this possible?
}

This would not be an unsolvable issue if not, the way to solve it could be this:

ask the keyed container for all coding keys, via .allKeys
decode Unevaluated for all the keys
store the results in a dictionary and use it for re-encoding

I see that this not a big thing, since it is possible to do. However, to me it seems like this is part of the use case for which Unevaluated is actually meant (various examples like the one in the thread above).

To ask an even more general question: All cases I have seen until now for Unevaluated were specific to a concrete format (JSON, actually). Is there a such generic use case, that it can not be solved by a JSONValue (or MessagePackValue, or what ever)?
The one I can see is transferring to another format, which could be very easy with a thing like Unevaluated, but the current Unevaluated isn't usable for it.

The question is if this kind of need-to-re-encode-later issue only comes up in format specific decoding code, or if there are truly format independent structures (like Dictionary, Array, Set, Float, LinkedLists, other data structures) that need this. It is of course nice to keep custom decoding code as generic as possible, but to me, it seems as if you could make the necessary assumptions to use something like JSONValue in all the cases I have seen for Unevaluated. The assumptions I mean are: What are the possible structurings (this is actually set by decoder: keyed and unkeyed containers) and what are the "Primitives“ of the data, in case of a JSON specific API, this would be Strings, Bools, Numbers and Null. In msgpack you would additionally have binary data and extension values. Maybe the API sets some restrictions on the possible primitives too. In my current view, you would only really need Unevaluated if you have an API that is not bound to any format or to a format you can put data into in ways not even the format knows. I don‘t know if something like this exists. Maybe I‘m mistaken here.

If this just isn‘t clear for me and we are talking about adding Unevaluated because it makes it easier to handle such format specific cases, I would be fine with it.