LLVM monorepo transition

The LLVM project is moving to a “monorepo” at GitHub - llvm/llvm-project: The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github pull requests at this moment. Please submit your patches at http://reviews.llvm.org. (more background here). llvm, clang, clang-tools-extra, compiler-rt, and libcxx will be in the same Git repository. It's scheduled to become the canonical repository, replacing Subversion, at the next LLVM developers' meeting in October or November of 2019.

How does the monorepo transition impact Swift?

The Swift compiler builds against LLVM project sources hosted at github.com/apple, with histories in swift-clang, swift-llvm, swift-compiler-rt, swift-clang-tools-extra, and swift-libcxx based on git-svn mirrors of the Subversion repository.

  • These sources need to be rebased on top of the canonical LLVM project monorepo.
  • The Swift compiler and open source toolchain needs to build against this new repository.

We're working on it.

Duncan (dexonsmith) and I (Alex_L) are working with Mishal (mishal_shah) on a full transition plan for the impacted repos on github.com/apple.

The high-level goal is to rebase swift-llvm , swift-clang , swift-clang-tools-extra , swift-compiler-rt , and swift-libcxx , merging their histories into a new repository downstream of the LLVM monorepo, and to change swift 's update-checkout script to point at this new repository. The old repositories will be archived with their histories intact.

We'll follow up with more details in a few weeks.

14 Likes

Do you mean "rebase" in the git sense of the word? Or something else?

As with many SCM problems and git, there tend to be multiple solutions with different tradeoffs. What git approaches were ruled out and why?

EDIT – One more question: given that LLVM went with a "monorepo", was a downstream/derivative "Swift monorepo" considered? (Again, like many SCM problems, this is just a tradeoff with pros and cons.)

1 Like

We will not be doing a "git rebase". In our "rebase", we plan to reconstruct the commit history by zippering the downstream split histories into one downstream monorepo that preserves the existing split merge history from upstream to downstream, while reparenting those merges on top of the appropriate upstream monorepo commits. This should give us one-to-one mapping from an existing split github.com/apple/swift-{llvm/clang/..} commit to a new monorepo commit, and vice-versa. I think @dexonsmith should be able to give you a more detailed answer about the tradeoffs and the approaches that we are considering.

The "Swift monorepo" is certainly interesting, but we haven't considered it in our plans, as we think that it should be a separate topic of discussion. This separation of concerns should also allow us to transition to the LLVM monorepo without affecting the majority of Swift developers and their existing workflows.

Interesting. I don't know if you're aware of this: the LLVM monorepo made some opportunistic cleanup relative to the "truth" that was in the SVN repository (for example, removing build results that were accidentally committed). How will the LLVM cleanup be reconciled in this transition? Will any additional cleanup be done as a part of this transition? (Removing no-op commits, parent simplification, etc.)

Separably, will a file be created in the repository that helps people map old git hashes to new git hashes? For example, if somebody mentions a hash in a bug database, it would be nice if they could just grep a file to get the post-monorepo hash.

1 Like

I was really hoping that the patchset was going to be rebased. Given that history must be rewritten for this to function, there is a benefit of the rebase - it will make it obvious what the patchset currently looks like and would be easier to try to integrate the changes into the upstream repository. I realize that the opportunistic benefit here would come at a great cost - most of the patches have been smeared across years of development. This means that rebasing the patches is not particularly straightforward (which I believe does increase the difficulty of someone else trying to merge the changes into upstream).

Separably, will a file be created in the repository that helps people map old git hashes to new git hashes? For example, if somebody mentions a hash in a bug database, it would be nice if they could just grep a file to get the post-monorepo hash.

That's a good point, a file is a great idea. There are also other strategies to help this issue: like annotating the commit message with the pre-monorepo hash. This is an approach that would take advantage of the fact that we'd be re-writing history already and would help when navigating history.

1 Like

Yes, we are certainly aware of cleanups made in the upstream monorepo. The upstream cleanups will be propagated to the new downstream monorepo. I believe @dexonsmith was looking into whether we could perform additional cleanups as well, so he might be able to provide more insight as to what the options are like there.

At the moment we're planning to store the mapping from the old git hashes to new ones using the idea that @kocsenc suggested, i.e. by annotating the commit messages. We are also planning to provide a tool that will allow you to perform this conversion easily, without the need to dig it out of the commit history manually.

I know I implied this question earlier, but please let me be explicit: why was history rewriting/cleanup chosen over, say, git subtree merges which don’t rewrite history or invalidate existing hashes? (I can see good arguments either way.)

Hi @Alex_L – Ping. If you have the time, can you answer my question? History rewriting, for better and for worse, forces private repositories to change somehow. I also assume/hope that Apple has bought into history rewriting, because they have a lot of behind the scene repositories for various reasons.

Hi @DaveZ, sorry for a delayed reply. The simple answer is this: we decided to go with history rewriting because we wanted to create a true downstream of the new llvm.org llvm-project monorepo, where the upstream commit hashes would be the same throughout the commit history. The subtree merge would've added the old upstream commits from the split repos, and we would've ended up with a monorepo that had two sources of upstream commits: from the old split history before the monorepo became canonical, and the new upstream commits that are merged in into the downstream monorepo after it became canonical.

It's true that history rewriting forces us to change the repos, but we see it as an opportunity (e.g. to perform cleanups) rather than a problem. Duncan (@dexonsmith ) has been working on a set of tools to generate cascading downstream monorepos with rewritten history for a while now, so yes, we certainly did buy into it. I'll be posting an update about our progress today, so please stay tuned ;)

I wouldn't overstate the negative consequences of history rewriting. LLVM could get away with it because the git mirrors of SVN were never considered canonical, even though lots of people treated them that way.

Personally speaking, I've found git history rewriting and clean up to be fun, but I've also I've grown skeptical of it over time. Git hashes tend to end up in bug databases, emails, etc, and semi-invalidating them is the equivalent of "broken links" on the web. Sure, if you work at, you might find the updated URL, but most people don't bother.

Here is a non-history rewriting alternative: git "grafts". Have you considered it? In short, one would "squash merge" the current derived split repositories into a new repository that is forked from from the canonical monorepo. At the same time, import the split repositories into archive/* branches. (Because nothing requires that different branches originate from the same commit.)

This gives users choices:

  1. Live with the flag day commit, and manually chase history as needed.
  2. Live with the flag day commit and let git chase history by grafting the squash commit into a merge commit via .git/info/grafts
  3. Ignore the old history completely and don't clone/fetch it to save space.

From our early experiments we found that git didn't perform that well when cloning / fetching history from a remote with disjoint branches (e.g. new flag day commits, and old split archive branches). That's one of the reasons why we decided not to integrate multiple disjoint histories in one repository.

Interesting. I have a repository where the various split swift repositories are regularly "octo subtree merged" into a toolchain directory, and I haven't seen any performance problems. That being said, I never reclone a repository.

Did you repack the repository before doing performance testing? (I.e. git repack -afd --window=250 --depth=250). This will also save bandwidth and time for git novices and badly written CI tools that feel the need to reclone instead of making the local worktree match the result of a reclone.

FYI – If we set disjoint performance aside, subtree merges don't play nice with the default history simplification algorithm that git log/blame/etc uses. One needs to counterintuitively pass --sparse to force the subtree history to be shown. I've been meaning to file a bug on this.