Diffing symbolgraphs (or: how can i avoid checking in a ~30MB diff every time i regenerate one?)

every time i re-generate a symbol graph, the ordering of the JSON keys changes, and because symbolgraphs are minified, this means the entire file (often many megabytes in size) gets overwritten in git. this also makes it hard to detect actual changes in the symbolgraph output, since every regenerated symbolgraph shows up as a modification.

is there a better workflow here?

Is it's possible to modify that JSON generation? My favourite option is sortedKeys, available with both JSONEncoder and JSONSerialization.

…sort the keys?

that would involve adding another stage of postprocessing whose only purpose is to stabilize source control. the compiler really should just emit the JSON with keys sorted by default (or is there already a setting for this?)

i will add that i tried implementing this as part of the swift-documentation-extract SPM plugin using swift-json, but that’s tricky because swift-json itself depends on the plugin, which makes a circular dependency…

I think the stability of the JSON is being discussed in GitHub issue #334

@ethankusters mentioned a workaround here that may help in the meantime, although it sounds like this won't completely resolve the issue just yet.

1 Like

Another way to handle this is adding a git filter: Git - Git Attributes and using a tool like jq

echo "*.json filter=sort-keys" >> .gitattributes
git config filter.sort-keys.clean "jq --sort-keys -c"
git config filter.sort-keys.smudge "cat"
5 Likes

that’s an interesting approach, thanks!

investigating further, this approach also does not work, because the ordering of edges in the symbolgraph edge lists is also non-deterministic.

A:

    "relationships": [
        {
            "kind": "memberOf",
            "source": "s:7NIOCore7ChannelP6NIOSSLE17nioSSL_tlsVersionAA15EventLoopFutureCyAD10TLSVersionOSgGyF",
            "target": "s:7NIOCore7ChannelP",
            "targetFallback": "NIOCore.Channel"
        },
        {
            "kind": "memberOf",
            "source": "s:7NIOCore15ChannelPipelineC21SynchronousOperationsV6NIOSSLE17nioSSL_tlsVersionAF10TLSVersionOSgyKF",
            "target": "s:7NIOCore15ChannelPipelineC21SynchronousOperationsV",
            "targetFallback": "NIOCore.ChannelPipeline.SynchronousOperations"
        }
    ],

B:

    "relationships": [
        {
            "kind": "memberOf",
            "source": "s:7NIOCore15ChannelPipelineC21SynchronousOperationsV6NIOSSLE17nioSSL_tlsVersionAF10TLSVersionOSgyKF",
            "target": "s:7NIOCore15ChannelPipelineC21SynchronousOperationsV",
            "targetFallback": "NIOCore.ChannelPipeline.SynchronousOperations"
        },
        {
            "kind": "memberOf",
            "source": "s:7NIOCore7ChannelP6NIOSSLE17nioSSL_tlsVersionAA15EventLoopFutureCyAD10TLSVersionOSgGyF",
            "target": "s:7NIOCore7ChannelP",
            "targetFallback": "NIOCore.Channel"
        }
    ],

symbolgraph data is quite complex, so i would expect there is similar behavior with other nested structures, like platform availability lists.

I filed an issue on this a while back because it also affects how the symbol graph code is tested in the compiler. It needs to define a deterministic ordering for symbols and relationships and then perform this sorting before rendering the JSON. There could be an issue about performance for larger symbol graphs, so we may want to measure that impact before landing a change like that.