Xcode Docc huge archive file size

I'm trying to document my application "Not a library" but the X-code file size for archiving it is really big and unacceptable it's around 200MB! Do you guys have any idea how can I reduce the file size?

For an uncompressed .doccarchive file, that's not out of line in terms of overall size. I just did a quick build of a little utility SwiftUI app I wrote, and it's coming in at ~114Mb.

App documentation is going to be notably larger than library documentation, because it defaults to documenting internal and public symbols, where a library defaults to only documenting public symbols. The majority of the space in that archive is likely its "data" directory, where there's a JSON file for every symbols that's included in the archive. The "documentation" directory can be large as well - it has a placeholder "index.html" file for every JSON file that represents a symbol.

To give you an example with numbers, the App is public at GitHub - heckj/SPISearch: A utility application to capture and review search results from Swift Package Index., and the docarchive I created (through the Xcode app : Product > Build Documentation, then exported) has the following pattern of space consumption (I used du -sh in the terminal to get these numbers):

du -sh *
 52K	css
 73M	data
 16K	developer-og-twitter.jpg
 16K	developer-og.jpg
 31M	documentation
  0B	downloads
 16K	favicon.ico
4.0K	favicon.svg
  0B	images
 20K	img
4.9M	index
4.0K	index.html
348K	js
4.0K	metadata.json
4.0K	theme-settings.json
  0B	videos

Long and short - 200MB might be shocking if you were thinking this was just some simple, flat HTML file generation, but it's not - and that's not an unreasonable size for how DocC is designed today.

5 Likes

This is something we've noticed in the Swift Package Index as well. One package we host docs for generates almost 500MB of docs from 1.1MB of source files...

The cause seems to be that this package has lots of functions and every function comes with a 10kB json file. This mounts up quickly to this massive archive.

I meant to bring this up in the Doc WG at some point to see if there's something that could be done to trim this back a bit perhaps. Given that we're going to be keeping historic versions, the impact could be significant!

1 Like

For projects that contain structs that conform to SwiftUI.View, I expect the large size of documentation catalogs to be due to the fact that each struct gets a copy of the protocol extensions (mainly view modifiers) of SwiftUI.View, which I believe are in the order of hundreds, due to how SwiftUI is designed.

For example, see the symbols in PhotoCamera vs. DocumentCamera. Each of these APIs get their own individual page, because from the language perspective, these are distinct symbols (synthesized by the compiler).

In the projects mentioned above, @hamed8080 @Joseph_Heck @finestructure I wonder what is the size gain is if you delete all the view-implementations folders in the DocC archive's data/ folder.

While the documentation here is indeed correct because these symbols do exist, I'd be in favor of a way to configure DocC to not generate pages for synthesized symbols because their documentation is accessible on Apple's documentation website. Ideally, as part of this, you'd still be able to browse the synthesized view modifier symbols of the page by making the SwiftUI.View text in the "Conforms To" section of the page link to Apple's SwiftUI documentation, but that can be left as future work:

Thoughts?

5 Likes

Perfect explanation of why maybe an archive file is large and I feel exactly the same way SwiftUI modifiers make a new version of the struct and it'll result in this we've seen.

But given the fact that I want to push my application to Github, and use Github pages. I'll have some trouble in terms of size limitation on the Github account.

Thanks, I learn something new every day. I've never seen the du command before. It's much more important than what I asked for here. I used to work with some applications to find out the size of directories.

You can use du -hs * | sort -h for "human-readable" sorting (by SI suffix).

2 Likes

I definitely agree we should explore ways to reduce the size-on-disk of archives. But just for context in this discussion, GitHub Pages currently supports published websites of up to 1GB:

  • Published GitHub Pages sites may be no larger than 1 GB.

I think the issue here is more about DocC's page-per-symbol generation model than it is about the particular format it currently emits. Each symbol in the project is going to produce a page, each page is going to have a JSON document, and SwiftUI View conformance leads to many more symbols than might be expected.

In your example, the bulk of the size is in the data directory which is a flat JSON file generation (one JSON document per page) very similar to what a flat HTML file generation would produce, just in a different format.

The additional 31 MB of duplicated index.html files in the documentation directory for supporting GitHub Pages is definitely something we could work to improve though.

(To be clear, I still see value in supporting HTML files, I just don't think it would inherently lead to size-on-disk improvements.)

in my experience 1 GB of storage goes quicker than you expect. :( for a package with 200 MB of docs a 1 GB limit means OP can only serve four past versions of docs for his package…

3 Likes

Imagine I create documentation with DocC for a large-scale application with tons of methods, classes, and structs. I believe it goes beyond the numbers that I wrote here. Thanks for creating such a powerful tool, but there is a really big obstacle here that prevents me to plan for publishing on Github-Pages.

1 Like

For comparison, Alamofire's generated DocC bundle is 22.2MB, while our Jazzy docs, minus the Dash docs, are 8.4MB.

2 Likes

Even if size wasn't a consideration, I would be +1 on removing these. I see loads of these on every type:

image

And for the most part they are just clutter, IMO. Associated types can be worth mentioning (because that is a way in which different types can differ in their conformance), but other than that, all conformances to a protocol have the same members. The pages for these members are barebones - just "inherited from X". That's all.

image

These things just don't add value.

3 Likes

pruning synthetics is important and swift-biome does this, but it’s not going to make a dramatic difference for OP because it’s not like 99 percent of the symbols in a typical package are synthesized, it’s more like 40 to 50 percent, so at best it will cut the archive size by half.

in theory synthetics can gain unique documentation from a documentation extension, so even if the members are the same the documentation can be different.

some types vend their API through their Collection/Sequence conformances. so removing the protocol members would remove important information about how to use those types.

however i agree a lot of these are just useless. what the hell is a halfWidthCornerQuoted?

1 Like

On a related note, I really hate that Xcode now shows the "Inherited from..." line rather than the actual documentation, with no link to the actual documentation either. Really all of these inherited or synthesized documentation points should essentially alias the underlying documentation, not duplicate it. Not sure how to accomplish that. But this "Inherited from..." behavior should definitely be reversed.

8 Likes

The Swift compiler already has a -skip-synthesized-members argument which will remove these symbols. It should be possible to pass this flag in the OTHER_SWIFT_FLAGS build setting in Xcode. I filed Support for skipping synthesized symbols · Issue #27 · apple/swift-docc-plugin · GitHub to support this via the SwiftPM plugin.

3 Likes

I want the synthesized API to be visible in some way (as well as inherited API), I just don't want it to block the inherited documentation. Ideally it would somehow alias the original docs into the type's documentation, with perhaps a note saying it's inherited. If that can't be done I'd rather the whole documentation was duplicated inline than blocked.

2 Likes

Passing the --enable-inherited-docs flag to DocC duplicates the synthesized symbol's documentation into the archive, but I don't think this should be the default behavior for at least two reasons:

  1. It makes the DocC archive size even larger, because each page now has the symbol's full documentation.
  2. Links in the synthesized symbol's documentation may not resolve when the documentation is duplicated. For example, if the documentation for a symbol synthesized from SwiftUI links to another SwiftUI API via a relative symbol link, e.g., ``View``, that symbol won't resolve when compiling documentation for your framework, because ``View`` isn't defined in your framework.

To me, the right approach here regarding how these synthesized symbols are treated is to not duplicate documentation, and instead have DocC automatically link to the original API the symbol is synthesized from. This would fit in nicely with the work @ronnqvist proposes here: Use cases for combined documentation of multiple targets in Swift-DocC. And until this is implemented, we should allow users to turn off synthesized symbols via a command-line flag to unblock them.

for what it’s worth, Biome can already do this, example: first - SwiftSyntax Documentation

HTTP/1.1 200 OK
host: swiftinit.org
link: <https://swiftinit.org/reference/swift/collection.first>; rel="canonical"
content-length: 4008
content-type: text/html; charset=utf-8