Rearchitecting JSONEncoder to be much faster

michaeleisel · August 22, 2019, 4:52pm

Currently, JSONEncoder can spend a lot of time doing the following things:

Creating intermediate Swift/ObjC data structures to pass to JSONSerialization
Creating a new Swift String for each key conversion
Inside of JSONSerialization, using libc functions to convert numbers to strings

All of these things can be resolved with a different architecture. There may be other good ones, and if so I'd love to hear about them, but here's what I'll pitch:

Whenever an encode method is called, write the JSON out directly, rather than storing anything in dictionaries or delegating any work to JSONSerialization. The one tricky bit is that it may be told to encode something that needs to come later after other, unseen, pieces of data are encoded. For example:

let container1 = encoder.unkeyedContainer()
let container2 = encoder.unkeyedContainer()
container2.encode(...)
container1.encode(...)

If container2's data is written directly to the final Data that will be output, then we have a problem when container1 tries to write to it. However, this will only happen, correct me if I'm wrong, when the user has implemented their own encoding method and is getting a bit wonky with it. At that point, it could keep track of multiple strings and paste them together. There are different ways to do this, and I can flesh out that part, but first I want to gauge interest in this overall proposal.

Another benefit is that, with Swift handling the whole process and not delegating to JSONSerialization, it will be able to optimize things like the conversion of numbers to strings, which is rarely as fast as possible when using libc functions. It may even be able to write multiple numbers at once in a fast way with vectorized instructions. But this is an optional benefit.

I am confident that these changes would make JSONEncoder at least the 3x the speed it currently is. I have employed some of these same ideas and philosophies to ZippyJSON, which decodes at 4-5x the speed of JSONDecoder.

Jon_Shier · August 22, 2019, 5:40pm

Yes please, JSONDecoder is embarrassingly slow, and JSONEncoder can be a bottleneck on the server side in some situations.

@michaeleisel It would be very interesting to see how much of ZippyJSON's performance comes from the various changes you've made:

Use of simdjson.
Removal of Foundation type boxing.
Lack of key conversion.

I'm not sure these things are easily separable, but the unfortunate fact is that Swift may not be amenable to importing simdjson as a dependency, so I'd have to wonder how much of a win there would be just by replacing the Foundation type boxing.

michaeleisel · August 22, 2019, 5:50pm

@Jon_Shier we may want to move that into a separate thread if we want to dig into JSONDecoder performance and its improvements. The reason I focus on JSONEncoder is because I've already released that lib for decoding. But to go over your points, and without trying to do a full analysis with numbers:

simdjson isn't super relevant to encoding
Removal of Foundation type boxing is big
Lack of slow key conversion is big. Not as big for encoding, but still will be a big win (particularly if it can be done at a lower level)

This brings us to an important point, if there is indeed support for this. How much should be written in a lower level language? For example, key conversion for snake <-> camel is way, way faster in C/C++. My preference is to do as much as possible in C/C++ (too bad Rust is hard to integrate into things) but there may be pushback on that.

David_Smith · August 22, 2019, 6:10pm

This is great stuff, thanks for doing it @michaeleisel!

I've been working on JSONDecoder/JSONEncoder performance, but with an orthogonal approach: I'm using it as a test case and optimization target for improving Swift <-> ObjC bridging. So far this has been (roughly) a 1.5x speedup, and I have some ideas for pushing that much further.

That said, I fully expect that this approach will run out of steam and we'll want to do something much like you describe, and I think it's not unreasonable that that could happen in parallel, since faster bridging and faster JSON are independently valuable.

Regarding the question of what language to write things in, as a stdlib maintainer I am of course in favor of trying to write it in Swift and filing a flurry of compiler and stdlib performance bug reports on anything that blocks that from being fast (e.g. converting keys to Swift Strings should be extremely fast for most keys because of SmallStrings. If you're seeing it being a problem I bet you're running into something fixable in the bridging system). I realize there are pragmatic reasons not to do this though :)

One potentially thorny issue is maintaining as much compatibility as possible with NSJSONSerialization; the JSON spec is ambiguous on some issues, and we've historically found that many JSON parsers both aren't spec compliant in edge cases, and interpret the ambiguities differently. Not to say that simdjson specifically does either of these, just that it's something to keep in mind.

[edit]

Also it's worth keeping in mind that the situation is currently completely different on Darwin and Linux, since Darwin uses the ObjC implementation of NSJSONSerialization and Linux can't.

michaeleisel · August 22, 2019, 6:40pm

Also it's worth keeping in mind that the situation is currently completely different on Darwin and Linux, since Darwin uses the ObjC implementation of NSJSONSerialization and Linux can't.

Are you saying Linux uses a separate implementation? Are there any performance implications from that? In the source code of JSONEncoder.swift, I don't see any platform specific branching around try JSONSerialization.jsonObject(with: data).

As for copying NSJSONSerialization's behavior, it seems like it would at least be easier on the encoding side than on the decoding, as there it's receiving Swift objects rather than arbitrary data.

I can't say I know too much about the ins and outs of bridging performance, SmallString, etc., so you may be able to estimate how far these things will go better than me. Typically, whether with Swift or ObjC, my strategy is always to get things out of ObjC/Swift land and into C-level data types, do all the hard work there, and then at the last moment convert back. It seems you have the opposite approach, which certainly has advantages.

Jon_Shier · August 22, 2019, 6:57pm

JSONSerialization itself has a separate Linux implementation, as part of swift-corelibs-foundation.

I would encourage you to benchmark your implementations in Swift first, reporting any bottlenecks as bugs. It's the only way we can improve the performance of the language, and it will make porting between platforms far easier. Using Swift for everything also encourages open source contributions, as contributors don't have to learn multiple languages.

michaeleisel · August 22, 2019, 7:22pm

Without getting us too deep into the Swift vs. C/C++ question, I'd like to first understand if it would likely be accepted or rejected, and if accepted, if there's anyone who's interested in working on it.

David_Smith · August 22, 2019, 8:09pm

To be honest I'm not sure how the corelibs version of JSONSerialization fairs vs the ObjC version. It would be interesting to measure! It is not particularly heavily optimized (I have a speculative PR up that may help), but avoiding bridging more might be enough to put it in the lead anyway.

Really this is mostly about different goals. You're trying to write a fast JSON coder, I'm trying to write a fast Swift to help other people write fast software (including JSON coders). Both are great things to do

mayoff · August 22, 2019, 9:15pm

Is this even an intentionally-supported use? Is it expected or required in general that an Encoder or Decoder support multiple open sibling containers? cc @itaiferber

millenomi · August 22, 2019, 10:32pm

Per today's discussion: note that JSONSerialization does employ bridging currently (and to some extent has to bridge, because we want to return NSNumber through the Any return type, so that a downstream cast can switch it to any appropriate numeric type.)

Optional · August 23, 2019, 1:01am

JSONEncoder is a test piece.

TellowKrinkle · August 23, 2019, 1:58am

I really wish Encodable had used a closure-based approach (encoder.withUnkeyedContainer { c in /* encode to c */ }) to prevent this, it would have made a lot of things much simpler. Especially since with the current system it's practically impossible to make an Encoder with just structs from what I can tell.

TellowKrinkle · August 23, 2019, 2:49am

Just tested, it looks like JSONEncoder fails a precondition if you try to make multiple containers, which makes sense. You can, however, do this, which is just as bad:

struct Test {
	var a: Int
	var b: Int
}

extension Test: Encodable {
	func encode(to encoder: Encoder) throws {
		var tmp = encoder.unkeyedContainer()
		var aEnc = tmp.superEncoder().singleValueContainer()
		var bEnc = tmp.superEncoder().singleValueContainer()
		try aEnc.encode(a)
		try bEnc.encode(b)
	}
}

lukasa · August 23, 2019, 8:37am

I would very strongly caution against doing this, especially with parsing. An enormous amount of trouble in the world is a direct result of passing untrusted data into unsafe programming languages.

While there are some pragmatic reasons for using C/C++ in some places, I don't think this is one of them. Code that definitely handles untrusted data should be written in a way to eliminate entire classes of bugs, and writing it in Swift does exactly that. I would propose that any implementation we involve should strive as hard as possible to not only use Swift extensively, but to minimise the number of times it uses the word unsafe, with a meta-goal that any point in the code that requires an unsafe to be accompanied with a bugs.swift.org explaining why it was needed and what it would take to remove it.

t089 · August 23, 2019, 11:56am

100% agree. I think a fast pure Swift JSON parser / encoder would help tremendously.

Has anybody ever tried to put the JSON classes from swift-protobuf behind the Codable interfaces and measure the performance?

michaeleisel · August 23, 2019, 2:23pm

I would propose that any implementation we involve should strive as hard as possible to not only use Swift extensively, but to minimise the number of times it uses the word unsafe

I'm not going to argue too hard in favor of the other side. If everyone wants this, I'm fine with it. My first goal is to just get buy-in for the architecture, especially from people who would accept/reject the PR in the Swift project.

Tony_Parker · August 26, 2019, 8:57pm

The reason that JSONEncoder and JSONDecoder call through to JSONSerialization is to keep that code as cross-platform (Darwin/Linux) as possible. On Darwin, the ObjC implementation of JSONSerialization is actually pretty fast (not counting the Swift bridging overhead); on Linux there has not been as much effort put into optimization.

We have suspected for some time that JSONEncoder/JSONDecoder could have a specialized JSON implementation that is written with those particular use cases in mind and could potentially outperform the general-purpose JSONSerialization on all platforms -- especially due to the mentioned bridging overhead. It just hasn't yet been a priority to investigate.

swift-corelibs-foundation does of course mix quite a bit of C and Swift, but I agree with others in this thread that we should be able to write fast code in just Swift. If not, then that can inform general improvements to the language or compiler. The C in swift-corelibs-foundation is about portability and not performance.

johannesweiss · August 27, 2019, 9:14am

Adding @Joannis_Orlandos who's also interested in fast JSON with Codable.

tbkka · August 27, 2019, 6:28pm

I wrote those classes. I've not tried adapting them to work with Codable, but I think it's worth trying:

For encoding, it should be pretty straightforward and the performance should be roughly similarly to SwiftProtobuf, since SwiftProtobuf's internal encoding APIs are pretty similar to Codable.
For decoding, I suspect the story is more complicated. The decoding support in Codable has some intrinsic overhead that's hard to avoid.

For those interested in JSON performance generally, I've explored some of this terrain in the process of building SwiftProtobuf's JSON support. In particular, SR-106 greatly improved the performance of getting a round-trip-accurate representation of a floating-point number (I believe it's generally faster than C/C++ now). For parsing, the C library strtod is actually pretty performant.

scanon · August 27, 2019, 6:37pm

There's still copious room for improvement, it's just less dramatic than dtoa was. =)