Ideas for Better Integrating Swift Packages with LLMs

Hi everyone,

I’m exploring ways to make Swift packages more LLM-friendly and would love to get your input.

As a library author, I’m looking for ways to make it easier for others to learn, teach, and use my library, possibly by providing something that LLMs can consume or exposing metadata that helps developers better understand how to work with it.

Do you think tools like swift-doc should support generating a special file (e.g., llm.txt) tailored for use with LLMs? Are there best practices or ideas you’ve come across that could help improve how our Swift packages are consumed or understood by LLMs?

Looking forward to hearing your thoughts!

Thanks you in advance,

Hey Jeffrey,

I've wondered the same thing - and in case you hadn't heard of it, there's a nifty online tool that scrapes out existing content into such a file at https://llm.codes (from @steipete). Pretty sure the whole of that is also available as open source on GitHub.

From a technical perspective, it's certainly possible to render a DocC archive out into a variety of forms - one of the benefits of it's current structure being that the archive and presentation of the archive are relatively separate - there's nothing really stopping from iterating through the JSON files that make up the archive, dumping them out and rendering them into markdown, or selectively assembling and compressing them into a more compact format to preserve a token budget you might have. Tutorials would be notably more challenging, as they're far more dynamic things, so to capture tutorial content into another form would probably be quiet a bit more effort.

That said, I personally don't have much insight on what it means to select specific text or examples, or how to present it - or even author for it - to make it more easily consumable. I've seen some basic prompts, but most of the more effective things I've seen seem to lean heavily into annotated code examples, which - while better than it used to be - still isn't super common for library authors to assemble, outside of a quick getting started sort of thing for a README.

1 Like

Joseph - thanks for sharing llm codes! I'd love to see better llms.txt support in docc, running that server costs me 5k/year.

swift-doc should absolutely be able to generate an llms.txt file, optionally also one that just includes code samples - often that's enough for agents. I'd make that a parameter, much like I have an option for it there.

1 Like

If the Swift Package Index were to generate this kind of LLM-friendly file, would it actually help, or would it just create more noise?

I also considered MCP to give info to LLM, but maybe a static documentation file, will be enough ? I'm not expert in how to feed information to LLMs to know for sure this is why I'm asking the broader audience.

I suspect using specific examples with documentary is more effective than just generating what’s already there but in a different file. For what it’s worth a condensed guide with tips and commonly asked things is likely to benefit both agents and humans alike if :thought_balloon:

Most importantly, documentation examples should actually compile. So making more use of snippets in docc is something that would be beneficial in both keeping docs up to date and the llms learning code that actually compiles :eyes:

2 Likes

Hi Jeffrey,

I have been investigating how well LLMs interact with docc-generated documentation and am working on a proposal to improve this, by adding an option to export a plain markdown version of the page alongside the JSON. This will be a more accessible format for models and should open up a lot of downstream processing opportunities. I will add a link to the proposal when it is ready.

4 Likes

Update: Issue created at LLM-accessible output · Issue #1301 · swiftlang/swift-docc · GitHub

PR at Experimental markdown output by jrturton · Pull Request #1303 · swiftlang/swift-docc · GitHub

Super interesting Richard, thank you for the links. I'd like to ask some questions about your LLM output proposal:

Since this seems to be oriented towards interop/consumption from a DocC archive outside of current single page Javascript app setup, how - and more specifically where - is that data exposed and where as an external developer can I find a description of it?

I saw in Experimental markdown output by jrturton · Pull Request #1303 · swiftlang/swift-docc · GitHub that the idea is to provide two experimental CLI options, and if both are enabled, DocC will create a index for the markdown content, as well as markdown content itself. What's the expected location of this index, and where are the variety of markdown files located once they're generated an inside the archive?

The proposal listed a bit of metadata for each Markdown file - I'm guessing that means that each symbol, tutorial section, or article will have its own markdown file. Is there any detailed description of what this metadata manifest includes, or how it might be allowed or expected to expand or change? (I'm wondering if there's a JSON data structure describing it anywhere that can be used by 3rd party developers wanting to consume this information). Nominally the DocC JSON content that DocC render consumes is documented using an OpenAPI JSON spec format, although as far as I can tell this output isn't currently verified for consistency or likely consumed.

Arguably the only thing consuming the raw JSON data in the archive is DocC-Render, but what you're proposing seems to be a bit more intentionally external - so I'd like to request that the metadata format be documented, and it's output format verified with tests. Along with the location of the index and its format, if generated. I'd like to make sure that anything this generates is well understood and easily consumed by 3rd party tools.

I'm very curious how this impacts the size - both the number of files generated, and the on-disk space consumed - when the additional experimental options are enabled. Have you run the output and compared the sizes with any public repositories?

As you were considering options for how to enable this, did you see any path to where the LLM markdown content could be made available as needed, without relying on the already-heavy web-scraping technique that's so common?

I'm specifically thinking towards an agentic use case where Swift package sources are already checked out and available for a project - could the LLM focused Markdown content be made available locally - rendered "on the fly" as it were when or after packages a resolved through the dependency chains - so that there didn't need to be any HTML traffic to documentation hosting providers?

Is there a path where docc could convert only to the LLM markdown content, creating it for direct local consumption? Or use the docc process-archive to generate the LLM markdown content?

While I haven't run the code you put up for PR yet, I'm guessing that any Snippets will be rendered as code blocks inside the markdown. Especially since those are likely to be extremely valuable LLM feed stock, is there any additional things that authors can do to highlight common use cases within them? Currently, you can have a bunch of snippets without any of them rendering in DocC - they only render when/where you reference them. Is there any option or possibility to include more common snippets that are meant to be consumed for LLM feedstock, with simple directions to provide some narrative context alongside the code or in the code?

Thanks for the thoughts, Joe! I’ll try and address your points in turn:

Since this seems to be oriented towards interop/consumption from a DocC archive outside of current single page Javascript app setup, how - and more specifically where - is that data exposed and where as an external developer can I find a description of it?

The structure of each output file (the markdown file, and the manifest) will be expressed in a separate target within DocC. External developers can depend on this target only if they want to use the structures, though the markdown files will be valid markdown and readable by anything.

I saw in Experimental markdown output by jrturton · Pull Request #1303 · swiftlang/swift-docc · GitHub that the idea is to provide two experimental CLI options, and if both are enabled, DocC will create a index for the markdown content, as well as markdown content itself. What's the expected location of this index, and where are the variety of markdown files located once they're generated an inside the archive?

Each converted page will be exported as a markdown file in the DocC archive, alongside (and named identically to, apart from the extension) the render JSON file. If enabled, the manifest file will be written to the root of the archive.

Is there any detailed description of what this metadata manifest includes, or how it might be allowed or expected to expand or change?

The detail of the metadata will be defined in the separate target mentioned above. These types are marked as SPI in the PR as this is an experimental feature.

I'm very curious how this impacts the size - both the number of files generated, and the on-disk space consumed - when the additional experimental options are enabled. Have you run the output and compared the sizes with any public repositories?

In terms of number of files, there is a rough doubling (most pages get a markdown equivalent). The markdown version of a page is roughly 10% of the size of the equivalent render JSON.

As you were considering options for how to enable this, did you see any path to where the LLM markdown content could be made available as needed, without relying on the already-heavy web-scraping technique that's so common?

This is one of the major motivations of this update. The documentation archive can be created locally with the flag enabled, and then the markdown is available on disk for whatever downstream processing your heart desires.

I'm guessing that any Snippets will be rendered as code blocks inside the markdown. Especially since those are likely to be extremely valuable LLM feed stock, is there any additional things that authors can do to highlight common use cases within them? Currently, you can have a bunch of snippets without any of them rendering in DocC - they only render when/where you reference them. Is there any option or possibility to include more common snippets that are meant to be consumed for LLM feedstock, with simple directions to provide some narrative context alongside the code or in the code?

Yes, snippets will be rendered as code blocks inside the markdown version of the page that represents them. There isn’t anything in this update to include unreferenced snippets, and I wouldn’t like to include any sort of editorial judgement in the process - the output should be a reflection of the written documentation. I would suggest that framework authors highlight important snippets in how-to documentation articles and use downstream processing to select which areas of the documentation they present to LLMs.

1 Like

I have two questions about this:

  • You say that most pages would get a markdown file. What pages wouldn't get one and why?
  • What data is the 10% number based on? I tried running your branch with both the experimental flags on SwiftSyntax and SwiftNIO and the size increase that I observed was 31% and 38% respectively.

There are pages that are not converted to markdown because there isn’t much merit to doing so, for example, a tutorial table of contents.

The 10% figure was based on a very unscientific eyeballing of side-by-side files after output of various frameworks, and I did mistype, apologies, I meant to include a range, 10-30%. I’ve confirmed this slightly more scientifically by running against a larger framework, which generates ~100MB of render JSON and ~20MB of markdown.

The percentage figure is also a reflection of the difference between a given render JSON page and its markdown equivalent. If you include the manifest as well, this will further increase the size, but the size of the manifest is very variable depending on the number of relationships expressed in the documentation. The test framework I was looking at had ~9,000 relationships and ~5,300 document pages, and the manifest was about 3MB. Note that in debug builds, the manifest JSON is pretty-printed as well, which will increase the size.

I prefer if we could get more scientific and reproducible measurements for this. Large archive sizes has been a highly discussed topic in the past and any double digit regression shouldn't be taken too lightly.

If we ignore the index.html files that DocC emits by default and focus on the "data" directory where both the JSON and Markdown files are emitted, then my method for comparing size increase would be:

  • make a release build of docc from your branch (that supports the new flags)
  • generate symbol graph files for the each module
  • build each module's documentation without the new markdown files using
    .build/release/docc convert \
      /path/to/TheCatalog.docc \
      --additional-symbol-graph-dir /path/to/ExtractedSymbolGraphDirForModule \
      --output-path ModuleName-before.doccarchive
    
  • build each module's documentation with the new files markdown files using
    .build/release/docc convert \
      /path/to/TheCatalog.docc \
      --additional-symbol-graph-dir /path/to/ExtractedSymbolGraphDirForModule \
      --enable-experimental-markdown-output \
      --enable-experimental-markdown-output-manifest \
      --output-path ModuleName-after.doccarchive
    
  • Determine the sizes of only the data subdirectories using
    du -hs ModuleName-before.doccarchive/data
    
    and
    du -hs ModuleName-after.doccarchive/data
    

For SwiftSyntax (44k symbols) this showed a 381 MB data directory before and a 557 MB data directory after which is a 46% increase. For SwiftNIO (3.3k pages) this showed a 21 MB data directory before and a 34 MB data directory after which is a 62% increase.

I ran the same thing with SwiftDocC (5.2k pages) just now and it showed a 35 MB data directory before and 56 MB data directory after which is a 60% increase.

Unless I'm measuring this completely wrong. That is a very big size impact.


If my method (above) is good and its measurements representative, it would indicate that the future direction of exporting one Markdown file per programming language variant (for example Swift and Objective-C) could potentially result in a size increasing by as much as 120% from tripling the number of files in the archive.

1 Like

Some clarification on the aims and the limits of this proposal and the current PR:

The only intention of the proposal is to create on-disk, human- and machine- readable outputs from the DocC compilation process. The proposal will not, on its own, affect documentation HTML pages generated by DocC or make them readable by LLMs. It is deliberately agnostic and non-prescriptive about the downstream uses that the markdown and manifest files can be put to. DocC users should only enable these flags if they have a specific use case for the additional files.

What the proposal is intended to enable is workflows that require the outputs of DocC in a more accessible format, including, but not limited to:

  • Building custom Retrieval-augmented generation (RAG) systems or knowledge graphs for specific sets of documentation.

  • Passing the markdown content directly to LLM or other tooling for summary generation, coding assistance or analysis tasks such as tone or language consistency.

  • Using custom post-processing to incorporate and serve the markdown files to website visitors

Outside the immediate scope of the proposal, but a valuable follow-on PR, would be a change to the static site generation process in DocC to include an alternate link to the generated markdown in each page’s head :

<link rel="alternate" type="text/markdown" href="https://example.com/symbolname.md">

This would then allow an LLM visiting a given DocC-produced documentation URL, without the ability to run javascript, to obtain the page content in a readable form.

4 Likes