Emitting a JSON representation of the DocC navigator index

Hi all!

I'd like to pitch an enhancement to Swift-DocC in support of the new navigation sidebar that is being developed in Swift-DocC-Render.

You can see more information about the in-development Swift-DocC-Render navigation sidebar in these threads:

This pitch is to enable emitting a JSON representation of Swift-DocC's navigator index (called RenderIndex JSON) by default. This will allow for immediate support of the Swift-DocC-Render navigation sidebar and, moving forward, easier adoption of navigator index information by other clients

Overview

Swift-DocC already supports emitting an index of the navigation hierarchy of a DocC archive during the conversion process. This index is emitted as an LMDB database and only if explicitly requested via the --index flag to docc convert.

In practice, the --index flag is almost always passed, since both Swift-DocC's integration with Xcode and its integration with SwiftPM via the Swift-DocC Plugin include the --index flag.

The current LMDB representation of the navigator index was designed for Swift-DocC's integration into documentation browsing applications and IDEs. It was also designed to support extremely large DocC archives. However, it is not well suited for consumption on the web and the average sized DocC archive we see in practice doesn't benefit much from the potential performance improvments of an LMDB representation.

The JSON representation will fill in this gap to immediately allow Swift-DocC-Render to display a navigation sidebar and, moving forward, allow for easier adoption of any clients that don't wish to use LMDB.

The change discussed here would mean that while docc convert continues to only emit the LMDB index if the --index flag (now --emit-lmdb-index) is passed, it will now always emit the JSON representation of the index so that clients can rely on it being there.

Links & Technical Details

In designing the RenderIndex schema, I prioritized readability as, in practice, we found that the prevalence and effectiveness of gzip compression on the web means that we don't need to make the JSON as compact as possible at the cost of readability. This also fits in well with our existing RenderNode schema.

Note that the PR for feature enablement also bumps the minor version of the RenderNode schema to 0.3.0 from 0.2.0. This will allow clients, including Swift-DocC-Render, to detect if a DocC archive is likely to include RenderIndex JSON to ease in the migration period where we have many existing DocC archives that do not contain the index. Without this, clients like Swift-DocC-Render would have to offer a sub-optimal UX when rendering older DocC archives where the page might flicker while making a request to see if the RenderIndex JSON exists.

Future Directions

Moving forward, I'd like to make the JSON representation of the navigator index fully interchangeable with the LMDB one. With this initial release, the JSON version doesn't hold all of the availability information of the LDMB version so we need to keep both.

But in the future, I think we can teach the SwiftDocC framework to read either a JSON representation of the navigator index or an LMDB one and make this totally transparent to clients of SwiftDocC. This will allow us to migrate the majority of DocC archives over to only having a JSON representation of the navigator index and leave the LMDB representation as opt-in for those with unusually large DocC archives where the speed benefits of the database representation are really worth it.

5 Likes

Overall sounds good to me! An easily-digestible curated tree of paths will make it much easier to build consumption and authoring tools.

I wonder if using interfaceLanguages as the top level container will become too restrictive in the future. Perhaps instead the top level key could be something like variants and each variant in the array could mirror RenderNode’s RenderNodeVariant schema.

If you decide to continue with the current approach:

  • I still suggest revisiting the name of the top level key; it’s a bit confusing because the primary thing you want out of it is the tree, not the language.
  • If possible, avoid using dynamic key names like swift and instead use an array of objects. Dynamic keys are a bit awkward with Codable and don’t leave you with with any room to revisit the variants approach without breaking existing clients.

Regarding file size: I’m curious how large the index file is (raw and gzipped) for a very large trees like the standard library or SwiftSyntax. Also, do you have any concerns about peak memory usage for the deserialized JavaScript objects?

Do you envision that the web navigator will work like the doc viewer, where there are separate navigator and on-page language pickers and you can independently browse the Objective-C tree while viewing a Swift page? If not, it might make sense to emit per-variant index files instead. I’d expect multiple variants to compress well but you’d still end up with largely duplicate trees in-memory.

Thanks for the writeup @ethankusters, this is looking really great. I'm excited for more clients to be able to leverage navigator data.

Curious about enabling the JSON index generation by default—have you measured performance implications?

Also, how does this interplay with DocC's existing --index flag which emits the LMDB index? Should we rename it to something more explicit such as --emit-lmdb-index?

I think the implication here is that we would rename it.

1 Like

Right, and I see in the PR that that's the case. It would be nice to also emit a warning describing the rename when docc is invoked with --index.

I think this looks great! The LMDB representation currently relies on a separate plist file to store some of the availability information (availability index). I think we need a plan for how we are going to store this in the JSON index.

1 Like

Thanks for all this great feedback @jack! (Also, just now fully realizing that you have the @jack handle here which is very cool. :smiley:)

I think this is a good concern.

To give a little background: In general, I wasn't trying to invent anything really new here with the basic tree structure. It's an exact mirror of how the tree in the LMDB representation works. I think there's some clear benefit here to initially keeping them the same and then improving them both together in the future since I think long-term we want them to be interoperable.

So that being said, I'd be hesitant to using the general variant schema for the top-level item in the navigator index because I think that invites too much variation. We shouldn't be creating entirely different navigator tree structures for most variants, it's really just an unfortunate necessity of documenting different source languages where you can have entirely different AST for the different languages.

There are other conceivable cases where you would want entirely different tree structures (one based purely on symbol type/hierarchy and not on custom curation has been suggested on the forums a few times) but I don't think we should treat that as a RenderNodeVariant, it should be a different concept.

For example, in a future world where we support localization, I think we would want to introduce a variant schema to the RenderIndex, but it would apply to the existing base trees, not introduce entirely new trees.

Hm I agree it's not great. Any suggestions on alternatives?

Is this close to what you're suggesting?

{
  "navigators" : [
    {
      "identifier": "swift",
      "displayName": "Swift",
      "nodes" : [
        ...
      ]
     },
     {
       "identifier": "occ",
       "displayName": "Objective-C",
       "nodes" : [
         ...
       ]
     }
  ]
}

I agree this is a more flexible and understandable schema. I think I would prefer to leave the overall structure as is though to keep it in sync with the LMDB structure and look to adopting this in a future version of the schema once we've achieved interoperability with the LMDB navigator.

The standard library is 147 KB gzipped and 2.1 MB raw.

@marcus_ortiz do you have any thoughts here?

The current implementation of the navigator in Swift-DocC-Render requires you to switch your current page to Objective-C to see the Objective-C navigator so I think your suggestion is applicable here.

However, because of the RenderNodeVariant schema, when the user switches to Objective-C, the site doesn't need to fetch any additional data to render the Objective-C content and the change is instant. So I think it might create a worse UX if you had to wait to download the Objective-C navigator before displaying it.

Definitely still worth considering moving forward but it seems like a break from the design decisions we arrived at for the RenderNode schema.

That makes sense. I think we should hold off on a warning for now though because some clients of docc (primarily thinking of the Swift-DocC Plugin) can't guarantee which version of docc they're going to run since they would just depend on which version is the Swift toolchain.

So those clients will need to continue to use --index for at least some time.

But we should definitely progressively move towards a warning and then removing it entirely.

1 Like

We are using a tool for recycling DOM nodes in the rendering of the actual sidebar itself so that we only render nodes that are visible at any given time to avoid major visual issues with working with this large amount of data on screen.

We haven't done any profiling of memory usage specifically for this new UI and the size of the JavaScript representation of the deserialized index data yet. I think that would be a great thing to look at for large trees like the Swift Standard Library.

1 Like

Agreed! I've been thinking about it a bit and it's currently not clear to me if it makes more sense to integrate that information into the primary tree or provide additional JSON files that would map documentation paths to availability information.

In practice the majority of symbols don't have specific availability information, they just fall back to the default availability of the framework they're a part of so my much hunch is we can get away with integrating this information into the main tree since we would only need it for a subset of the items. Then we would add an additional property to the top-level schema that would provide default availability.

I think this is generally out-of-scope for this particular pitch though but I'm looking forward to sorting this out so we have full interoperability with the LMDB representation.

1 Like

the standard library will not be your problem, your problem will be heavily-mechanized libraries like SwiftSyntax which suffer from quadratic explosion of protocol extension members × protocol conformers.

optimizing this in order to serve documentation in a sound manner (e.g., not accidentally making non-existent specializations accessible) is non-trivial. making the generated URLs stable is an even more involved problem and there are a huge number of edge cases to account for.

a word of advice from someone who has been working on this for years: as you scale up your tool to handle larger and larger symbol graphs, you will find yourself effectively re-implementing large parts of the typechecker. the naïve approach of precomputing the module interfaces into a list of symbols simply does not scale.

Agreed that we should definitely do some serious testing with SwiftSyntax! I went ahead and generated a RenderIndex JSON file for it here: Swift Syntax DocC RenderIndex · GitHub.

It's 4.5 MB raw and 294 KB gzip compressed.

I think this is a general issue with Swift-DocC's navigator index generation, separate from the pitched new JSON format. But I agree it's a challenging problem.

Great question! In general this isn't a huge concern because, in practice, we already perform indexing by default since both of Swift-DocC's integrations with Xcode and the Swift-DocC Plugin pass the --index flag.

However, the data is still interesting and relevant here. I went ahead and collected data for both Swift-Markdown and Swift-Syntax:

Project Building with Indexing Disabled Building with LMDB Indexing (Current Default) Building With LMDB+JSON Indexing (Proposed Default) Building With JSON Indexing (Future Default)
Swift Markdown 1.01 1.10 1.10 1.06
SwiftSyntax 23.51 26.37 27.24 25.37

Values in seconds show the average time, over three runs, it took docc to perform a full conversion, excluding the time it takes the Swift compiler to emit the symbol graph files.

What it shows is that while we'll take a slight performance hit for taking the time to emit both the LDMB and JSON representation, we'll see a greater performance win when we can fully migrate to JSON.

I think this definitely shows that we should prioritize making the JSON and LMDB fully interoperable in the near future.

Assuming you mean "SwiftPM" here :slight_smile:

Thanks for the data, the way you categorised your findings makes it super clear that is not a huge concern. I do feel like we should consider trying to offset that slight hit (+3.3%) with another performance optimisation before we release for the next Swift version. Is it worth considering outputting the LMDB and JSON indexes in parallel?

What about in terms of total DocC archive size? Based on the data for SwiftSyntax mentioned above, this seems like a small increase relative to the archive size without the index?

I think that's a great point. Definitely worth considering parallelizing the output here. I went ahead and filed a bug to track that here: [SR-15965] Address slight performance regression caused by emitting JSON navigator index · Issue #175 · apple/swift-docc · GitHub.

Here's the size of the produced DocC archives for the same data set from before:

Project Archive Built with Indexing Disabled Archive Built with LMDB Indexing (Current Default) Archive Built With LMDB+JSON Indexing (Proposed Default) Archive Built With JSON Indexing (Future Default)
Swift Markdown 10.8 MB 11.4 MB 11.6 MB 11.0 MB
SwiftSyntax 207 MB 217 MB 222 MB 211 MB

I think the size increase is negligible compared to the overall size of the DocC archive. It is interesting that the JSON representation is currently smaller, than the LMDB one. We'll have to see if that holds true once we add in availability information but it's certainly promising.

2 Likes

You raised some great points, I agree that it's appropriate to scope this to programming language. I was thinking tree and page variants would be 1:1 but clearly that's not the case.

I think navigators or trees or something like that would make sense. If you're not going to initially address the next comment though, I don't know if it's worth changing right now.

Indeed, this is what I'm thinking. Then clients that consume the format wouldn't need to rev when a new language is added, or if DocC decides to organize the docs on some other axis, or if you want to add additional per-tree metadata.

My understanding is that you're pitching this index file itself as a feature of DocC that people can consume in their own tooling, so I'm not sold that the implementation concern you raise is relevant. Changing the format later seems harder once clients have begun consuming the file. That said, we're still in early days of folks creating tooling around DocC—perhaps stability isn't a big concern at this point.

I think these are all really good points and I generally agree. For the initial implementation I focused on achieving parity with the existing LMDB implementation but these are all good things to consider as we look to the future.

I think I'd prefer to land this as-is, since the current format achieves our current needs and goals. Then, when we have a clearer idea of how we'd like to expand features based on the NavgiatorIndex, we can take this into account when designing a future revision of the spec. Does that seem reasonable?

I really appreciate you taking the time to provide all of this great feedback.

- Ethan

1 Like

Sounds like a reasonable approach to me. None of the schemas have reached 1.0 yet anyway so I don't think we're promising much in the way of doccarchive stability yet anyway.

2 Likes

This proposed change has been merged to main with Emit RenderIndex JSON by default (#100) · apple/swift-docc@d68baeb · GitHub.

- Ethan

2 Likes