Shallow git clone for CI purposes

We have a SPM Package for building the Firebase SDKs for zip distribution and have an internal CI that pull the project and all dependencies every build.

I did some digging and saw that the SPM git clone intentionally does not do a shallow clone due to the cost of iterative updates.

Do other folks think it would be helpful to have a command line flag to do a shallow clone during a SPM build? I assume it could be a parameter of the GitRepositoryProvider initializer and used in the fetch(repository:to:) function, but I'd have to try to implement it first to confirm.

I did a trivial test (outside of SPM) for time and bandwidth required to clone one of our dependencies (swift-protobuf in this case) and the difference seems non-trivial to me (compounding across multiple builds and a growing number of dependencies):

Deep clone:

$ time (git clone https://github.com/apple/swift-protobuf.git deep-clone)
Cloning into 'deep-clone'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 18239 (delta 1), reused 4 (delta 1), pack-reused 18229
Receiving objects: 100% (18239/18239), 17.88 MiB | 4.67 MiB/s, done.
Resolving deltas: 100% (15037/15037), done.

real	0m7.132s
user	0m5.793s
sys	0m0.379s

Shallow clone:

$ time (git clone --depth 1 https://github.com/apple/swift-protobuf.git shallow)
Cloning into 'shallow'...
remote: Enumerating objects: 416, done.
remote: Counting objects: 100% (416/416), done.
remote: Compressing objects: 100% (380/380), done.
remote: Total 416 (delta 151), reused 76 (delta 25), pack-reused 0
Receiving objects: 100% (416/416), 1.01 MiB | 2.78 MiB/s, done.
Resolving deltas: 100% (151/151), done.
real 0m1.298s
user 0m0.207s
sys 0m0.159s

Emphasizing two important parts:

Deep

real 0m7.132s
17.88 MiB

Shallow

real 0m1.298s
1.01MiB

The time to clone and data downloaded provides a nice benefit for the one-off builds, and saving bandwidth in the CI system would be a nice improvement by not pulling in data it won't use as well.

If this is something that other folks have interest in (or there's no pushback on it), I'm happy to investigate building the feature locally and doing some proper benchmarks.

Thanks!

3 Likes

If you really want peak efficiency, then never re-clone a project (shallow or not). Here is a "git + CI" cheat sheet:

Dedicate a subdirectory to CI clones/worktrees (i.e. where you don't do any development), then do this: git pull --ff-only per project and if you're feeling paranoid, then do git clean -fdx per project. Presto! The same result as recloning but as efficient as possible.

If it actually is faster, it would be nice to improve efficiency without need for a special flag.

But the comparison you show is just the timing of a single isolated Git command. Unless every package in the dependency tree only uses .exact versions, the package manager still must check out manifests from multiple revisions in order to resolve the tree.

Maybe when fetching according to a valid Package.resolved, the clones could reasonably be shallow, and the deep fetching delayed (git fetch --unshallow) until the root manifest changes and it needs to resolve anew? That might theoretically speed cloning up for simple checkout and build operations.

I suspect it just needs someone to do the empirical testing to find out what actually is fastest over a range of sizes for the repository, version list and dependency tree. If a new strategy can be proven to be generally better, I doubt anyone would object to switching.

(Any such testing should probably use the master branch of the package manager, since a lot of work has been done on the resolution logic since the latest releases branched.)

Unfortunately I don't think that will work for our setup - it's essentially an internal Jenkins instance (called Kokoro, more information here) and there is no dedicated machine for our builds, we get a random machine from a pool. Did you have something else in mind that would work in this situation?

Based on the comment in the code, it seems like this was tested originally and for general purpose development doing a full pull is more efficient during iteration, but maybe some more testing is needed.

That is a great point that I certainly haven't thought of! :slight_smile: in our case we're only using .exact but great point about having to resolve the version. Perhaps after a valid Package.resolved is available this could be used like you mentioned.

Thanks for the replies folks!

That's a good idea if well implemented. I am also fine with starting with a flag to always perform shallow clones which should be much easier to implement.

I think this is something everyone should get by default so they don't have to know about this flag, assuming we don't know of any downsides (unless you just meant it would have the flag for a short migration window to this becoming the default?).

I meant it seems reasonable to add a flag so user can opt-into shallow clones where the clones are short lived (like the CI use case) until we can implement some sort of feature to automatically manage shallow vs full clones.

2 Likes

Has there been any more thought about providing different git clone options.

We've resorted to copying some of our dependencies to GitHub repos with the git metadata removed and seeing substantial speedups.

For example, this report shows an improvement of >10 minutes to 170 seconds - adding firebase hangs when fetching grpc ¡ Issue #6606 ¡ firebase/firebase-ios-sdk ¡ GitHub

The Swift Package Registry Service will help with this. But being able to shallow clone based on the Package.resolved file would still be helpful in the mean time.

1 Like

I'll just add a relevant tidbit here: the Homebrew team got directly contact by GitHub team to request them to stop using shallow clones, because computing those was so taxing on them, server-side: Why is a full clone of homebrew/core required for brew update? ¡ Discussion #226 ¡ Homebrew/discussions ¡ GitHub

As DaveZ already pointed out, setting up your CI to retain its repo from build to build is more efficient, anyway.

I did some digging and saw that the SPM git clone intentionally does not do a shallow clone due to the cost of iterative updates.

Please forgive me. But what is “iterative update”? :sweat_smile:

A deep clone acquires more than necessary, but is a single operation.

An iterative strategy would be a shallow clone followed by individual requests for 1.0.0 then 1.0.1, then 1.0.2 and so on. This acquires only what is necessary, but is more work, both in network activity and in computation on each end. This strategy was avoided, because when SwiftPM needs to read from many different versions to resolve a graph, iteratively requesting individual pieces of information is usually slower.

But when you are reasonably certain the first request will be all you need—such as when you already have pins—then a narrow shallow request may be faster, and this is an opportunity for optimization. (I say “maybe”, because Git’s own internal optimizations may still outweigh the effect; knowing for sure would require testing.)

1 Like

to be fair, this is a problem that SPM creates for itself, it only needs to check out all the tags to read a single file, Package.swift, from those snapshots.

i am not a git expert, but i imagine a few ways it could just load the manifest file without checking out the entire snapshot, for example by fetching Package.swift individually from GitHub:

https://raw.githubusercontent.com/apple/swift-nio/2.58.0/Package.swift

Yes, I know. SwiftPM’s lack of obvious optimizations in this area is embarrassing. I am only describing the thought process long ago that chose the deep clone as the unoptimized works‐everywhere default.

The focus changed along with team membership and lots of things the earlier team intended to do seem to have been abandoned. These optimizations are one such area.

1 Like

It's not a single file though, it can also be a version-specific manifest.

Also GitHub and Git are of course not the same thing. AFAIK we have investigated this in the past and there's no Git way to get individual files.

I think a registry of sorts is ultimately the "correct" solution, nearly every package manager out there has one.

2 Likes

i want to quote @Jon_Shier on the similar thread because i think it is a reasonable assessment based on the amount of movement we’ve seen in the past few years:

of course, i would love to be proven wrong, and it wouldn’t be the first time either. (variadic generics, finally!) but unlike the story of variadic generics, where many people insisted for a long time that it was important to them and they fully intended to get around to it someday, and what do you know, they actually did get around to it today… we’ve never really gotten any communication along the lines of “a package registry is important to us and we are going to build one eventually.”

instead we only hear “a package registry would solve a lot of problems (that people currently blame on SPM)” and that it would be really great if we had one.

in my personal opinion, the likelihood of a package registry emerging from the community is very low. the problem is not engineering (@daveverwer and @finestructure have demonstrated that it is possible to index swift packages on an ecosystem-wide scale), the problem is that a package registry is simply not a viable business.

a package registry is a fundamentally money-losing enterprise, like vaccine development. they store, index, and serve large amounts of data to client-side tooling (as opposed to displaying it to users directly) which means there is no opportunity to build pagerank, serve ads, or promote other services, in fact, ideally, the end user does not even know the registry exists, as SPM should abstract this detail away. and that is why we only see swift package registries today in paid ecosystems (e.g. artifactory).

it might help to do some case study into the business model of package registries in other languages.

  1. crates.io (rust) is socialized. it is exclusively operated by their central language org (the Rust Foundation), which pays for its upkeep.

  2. CocoaPods (swift/objc) is a consortium project. several large corporations pooled resources to build infrastructure they are the primary beneficiaries of.

  3. RubyGems (ruby) is a public-private partnership. their central language org foots part of the bill and several large corporations that use ruby form the rest of the consortium. so it is essentially like CocoaPods except their government also behaves like an additional corporate sponsor.

  4. PyPI (python) started out like crates.io and was gradually privatized, and today it is structured a lot like RubyGems.

  5. NPM (javascript) is a gestalt project exclusively operated by Microsoft. the company is large enough that they can rationalize a return on their investment as “paying themselves”, both literally (it runs on their cloud platform) and figuratively (cultivating developers, “growing the pie”, blah blah blah).

there are no examples i am aware of, of a self-sustaining package registry.

7 Likes