Super interesting Richard, thank you for the links. I'd like to ask some questions about your LLM output proposal:
Since this seems to be oriented towards interop/consumption from a DocC archive outside of current single page Javascript app setup, how - and more specifically where - is that data exposed and where as an external developer can I find a description of it?
I saw in Experimental markdown output by jrturton · Pull Request #1303 · swiftlang/swift-docc · GitHub that the idea is to provide two experimental CLI options, and if both are enabled, DocC will create a index for the markdown content, as well as markdown content itself. What's the expected location of this index, and where are the variety of markdown files located once they're generated an inside the archive?
The proposal listed a bit of metadata for each Markdown file - I'm guessing that means that each symbol, tutorial section, or article will have its own markdown file. Is there any detailed description of what this metadata manifest includes, or how it might be allowed or expected to expand or change? (I'm wondering if there's a JSON data structure describing it anywhere that can be used by 3rd party developers wanting to consume this information). Nominally the DocC JSON content that DocC render consumes is documented using an OpenAPI JSON spec format, although as far as I can tell this output isn't currently verified for consistency or likely consumed.
Arguably the only thing consuming the raw JSON data in the archive is DocC-Render, but what you're proposing seems to be a bit more intentionally external - so I'd like to request that the metadata format be documented, and it's output format verified with tests. Along with the location of the index and its format, if generated. I'd like to make sure that anything this generates is well understood and easily consumed by 3rd party tools.
I'm very curious how this impacts the size - both the number of files generated, and the on-disk space consumed - when the additional experimental options are enabled. Have you run the output and compared the sizes with any public repositories?
As you were considering options for how to enable this, did you see any path to where the LLM markdown content could be made available as needed, without relying on the already-heavy web-scraping technique that's so common?
I'm specifically thinking towards an agentic use case where Swift package sources are already checked out and available for a project - could the LLM focused Markdown content be made available locally - rendered "on the fly" as it were when or after packages a resolved through the dependency chains - so that there didn't need to be any HTML traffic to documentation hosting providers?
Is there a path where docc could convert only to the LLM markdown content, creating it for direct local consumption? Or use the docc process-archive to generate the LLM markdown content?
While I haven't run the code you put up for PR yet, I'm guessing that any Snippets will be rendered as code blocks inside the markdown. Especially since those are likely to be extremely valuable LLM feed stock, is there any additional things that authors can do to highlight common use cases within them? Currently, you can have a bunch of snippets without any of them rendering in DocC - they only render when/where you reference them. Is there any option or possibility to include more common snippets that are meant to be consumed for LLM feedstock, with simple directions to provide some narrative context alongside the code or in the code?