How to write fast json parser in Swift


#1

Hello community.
I am making JSON encoder/decoder library which has useful function what Foundation one does not have.
It is FineJSON.
URL: https://github.com/omochi/FineJSON

To achieve some features, I need to use another JSON parser different from Foudnation.JSONSerialization.
So I am making original JSON parser.
It is RichJSONParser.
URL: https://github.com/omochi/RichJSONParser

First time, my first implementation is amazing slower than Foundation.
It was over x100 slower.
I tried to many optimization work while looking Time profiler in Instruments.
I referred implementation of JSON.parse and lexer of swift compiler.
Finally, my code grew to x40 faster than original.
But it is still x2.5 slower than Foundation.

This last wall is very high for me.
And I have no idea more to optimize this.
So I post this thread here.

Please tell me some useful information, idea, topics to get more faster.

This is current implementation.
URL: https://github.com/omochi/RichJSONParser/blob/ee380d49182db6cbf502b87ed4435d9deb0f9ac5/Sources/RichJSONParser/FastJSONParser.swift

Performance measurement code.
URL: https://github.com/omochi/RichJSONParser/blob/ee380d49182db6cbf502b87ed4435d9deb0f9ac5/Tests/RichJSONParserTests/TestCase/BenchmarkTests.swift#L21-L50

Current score. smaller is faster.

mine, Xcode 10.1
7.193193

mine, Xcode 10.2 beta
8.888982

Foundation, Xcode 10.1
2.880308

I have some concerns now in follows from profiler result.

String initialization cost.
My code build UTF-8 byte stream from JSON string value if it has backslash or multibyte characters. Theoretically, my implementation build valid UTF-8 sequence of course. But Swift.String also validates it in constructor. Can I cut this cost?

String transcoding cost.
Swift 4.2 keeps string as UTF-16 internally. So UTF-8 stream I built would transcode to UTF-16. It needs some operations.

Swift 5 is slower than Swift 4.2.
I ran benchmark in Swift 5 with Xcode 10.2 beta.
I expected it is faster than Swift 4.2.
Because it keeps string as UTF-8 internally. So transcoding computation can be skipped.
Strangely It is slower.
I am very confusing this result.

Array expanding cost.
From time profiler, array expanding consume certain CPU time. (_copyToNewBuffer)
It may happens when JSON array elements or object key-value pairs produced over than internal buffer capacity of array.
If I specify capacity (reserveCapacity) by heuristics prediction about JSON.
But number of elements is very variance from 0 to over 100.
It is hard tradeoff of time and memory.
I think its profit is very small.

Source location tracking cost.
My implementation tracks line number and column number in source of JSON.
It helps to produce human friendly error message when parse error is happened.
This needs more operation for two Int variable than only track offset.
But I feel this penalty is very small.

If all these costs are cut, My intuitive CPU time can be earned are small for gap between mine and Foundation.
What is last difference?

I concern about class allocation heap cost of Swift (_swift_allocObject_).
I predict that Foundation.JSONSerialization has some specialized allocator for JSON. I don't have enough knowledge about this. If it is, there is no way to fill this gap. Even if I fight with raw allocation or Unmanaged<T>, I need to build additional convertion process to oridnally Swift object for library interface.