URLs as Swift Package Identifiers

tkrajacic · January 29, 2021, 6:58pm

The problem is (please correct me if I'm wrong), that currently module names need to be globally unique. And this sucks

Frizlab · January 29, 2021, 8:42pm

It’s not a question of GitHub and GitLab, it’s a question of any provider. GitHub and GitLab are well know, but they are not the only ones in the world. If you hosted your repo on some other site, you have no guarantee whatsoever you’ll be able to still mirror the repo on said site should you be migrating. Also usually if you want to migrate, it’s because you want the code out of the server. Or the server might be down. So what happens in this case?
This point is moot anyway, Core Team review said “no URL identifiers.”

The point of the identifier is to be unique. Once you have chosen it, that’s it. If you want to change it, you’re just effectively creating a new project.
Moving the repo to another host will not however change the identity of the project as long as you keep the same ID in your Package.swift (or wherever it will be defined).

Not only. With this graph:

A
|-B
|-|-C
|-C
(A depends on A and C; B depends on C)

If project C is moving to another host, with a URL as ID, its ID would change. But now let’s say A knows the project has changed location, but B does not. So how would SPM reconcile this? It cannot and currently does not… and we get module duplicates.

John_McCall · January 29, 2021, 8:42pm

I think people may be skipping a few steps when leaping from the idea that "URLs don't mean identity" to "we need a centralized registry of all packages". Thoughts:

Currently, package identity is some combination of the name of the package and the URL, and both of those are problematic, because package names tend to not be reliably unique, and package URLs don't play well with repository moves and forks.
So we need packages to be able to declare their identity in some more elaborate way. But that doesn't have to mean that people have to use that identity to actually look up and check out packages, and therefore that you need registries of identity to URL. You could still use repository URLs for dependencies in package manifests as long as you can determine retroactively whether two packages are supposed to be the same. For example, maybe the package manifest says that it's jenny.appleseed.LinkedList instead of just LinkedList.
Ideally, something about the package identity discourages spurious conflicts by its nature. You could use reverse DNS, a large random number, whatever. In principle, it should be fine to use a less-unique identifier like your name for private packages, like the jenny.appleseed in my example, but we'd want to strongly discourage that in published packages. I suspect that the combination of good documentation/tutorials, social pressure, and some kind of package linter would probably be sufficient here. (We could even automatically run that linter on all the packages we find on a host like github.)
swiftpm needs to use that extended identity to recognize that it's got multiple copies of the same package checked out from different places. This of course introduces the possibility of a conflict when the repository is checked out from multiple places. swiftpm would probably need some way to automatically resolve trivial conflicts by recognizing that one repository is a superset of the other, or some similar rule.
We need some way to actually resolve those conflicts without having to manually edit the manifests of all your dependencies. Basically, you should be able to tell swiftpm to ignore the URLs in the package manifests and actually pull jenny.appleseed.LinkedList from your private fork. Maybe that's as simple as always preferring the URLs from downstream manifests.
We need something about the package identity to propagate to Swift module names in order to resolve the problem that, currently, module names have to be globally unique. For example, multiple packages should be able to provide a Utilities module, and that module should just be namespaced to that package. As long as this identity is consistent for all the modules in the package, it makes sense that imports would prefer modules within the current package; so import Utilities would reliably find the package's private Utilities module and not conflict with the Utilities module from some other package that happens to have already been built.
It would be very nice if package identities were naturally identifiers (possibly compound identifiers with dots between them) in Swift source code, so that you could use them to e.g. qualify an import statement. Of course we could invent arbitrary syntax for this and allow string literals, but there are things that feel right in source good and things that don't.

Cavelle_Benjamin · January 29, 2021, 8:44pm

Keep It Simple

Have the Unique identifier as the hash of the Owner Email, Package Name, git url, and Tag (Semantic Version)

Every listing is “owner/package-name” or “owner.package-name/[library|executable] similar to homebrew

In the Package. Swift file define a default registry but then allow each dependency definition list a specific registry.

In swift script files with the @package(), do the same thing.

Max_Desiatov · January 29, 2021, 8:55pm

Can you elaborate on these points? I saw them mentioned a couple of times, but somehow they don't connect for me with anything I've seen out there in the wild when using SwiftPM so far.

What exactly makes URLs unsuitable for denoting the identity? And what would be an example of a spurious conflict in this context then?

Max_Desiatov · January 29, 2021, 9:19pm

This is already the case, this ID is specified as the name argument passed to the Package initializer in package manifests:

let package = Package(
    name: "OpenCombine",
    products: []
)

Frizlab:

Not only. With this graph:
A
|-B
|-|-C
|-C
(A depends on A and C; B depends on C)
If project C is moving to another host, with a URL as ID, its ID would change. But now let’s say A knows the project has changed location, but B does not. So how would SPM reconcile this? It cannot and currently does not… and we get module duplicates.

It can and it currently does. Just today I had a dependency tree that depended on both upstream OpenCombine in a root dependency and my fork of OpenCombine through a subdependency. As in your example, C dependency of B was from a fork, but C dependency of A was coming from upstream. SwiftPM correctly resolved the tree through the upstream and used package name as an identity, which was a pleasant surprise to me, and the best behavior I hoped for.

Based on this, I'm still not sure what exact use case is supposed to be solved by proposed solutions with opaque identifiers or reverse DNS. If the problem is that package names aren't globally unique, there's no way around that other than a centralized authority that guarantees such uniqueness. Or if there is one, what is it then?

ahti · January 29, 2021, 9:39pm

If you require absolute all-or-nothing uniqueness, then I'd tend to agree. But I don't think such absolute guarantees are needed here. Taking as an example a reverse-dns identifier (and ignoring for the moment ideas of validating ownership for that somehow), collisions should become much less likely.

Even if I don't own lukasstabe.de, the odds of another developer (non-maliciously) publishing a package with an ID of de.lukasstabe.SQLele seem slim enough to be acceptable for me.

That might not be good enough for packages to be included in some registry that holds itself to higher standards (in particular, I expect, to combat malicious names like typo-squatting etc), but that imo should be up the the registry, and I'm not sure I like the idea that to publish a package I'd need to either pay for a domain or entrust ownership of my package's identity to some third party.

John_McCall · January 30, 2021, 12:03am

I've been informed that the repository-based identity is unreliable and is causing problems; I trust the people who tell me that. For something as important as identity, reliability is key. Do we know what rule swiftpm is using to avoid duplication in your example? Is it actually just the package name?

mmarston · January 30, 2021, 12:03am

This is a good clarification of what the Swift core team is thinking. Previously I (and likely others in this thread) took the following statement from @tomerd as saying that the registry protocol needed to be amended so that requests to the registry use opaque identifiers to look up a package instead of using package URL:

But now, if I understand the most recent comment from @John_McCall correctly, the core team's concerns may be addressed by having each Package.swift manifest declare a non-location-based identifier that can be used to address deduplication and other concerns, while package manifests continue to reference dependencies using URLs and the registry API can continue to use package URL to look up packages.

John_McCall · January 30, 2021, 12:42am

To be clear, I was just speaking on my own behalf there; I didn’t run that past the rest of the core team. But yes, I think our basic feedback is just that we think it’s important for package identity to be independent of hosting, and we think there are interrelated problems of package identity and module identity that it’s important to solve together. We’re not trying to insist on any particular approach.

Cavelle_Benjamin · January 30, 2021, 10:31am

If there is a default registry or a registry can be supplied per package, why not have the identifier by "owner/package:version"? Similar to what https://github.com/yonaskolb/Mint does.

If version is omitted, you assume the latest semantic version available. If you need to clip for patch or minor versions use modifiers like https://github.com/mxcl/swift-sh does.

This would keep the identifier mostly agnostic of the rep (assumes a default registry), but allows for assign a registry per package.

Max_Desiatov · January 30, 2021, 11:15am

What purpose does reversing serve here? Why not specify it as Package(domain: "lukasstabe.de", name: "SQLele") instead? Or, as a shorthand, Package(id: "lukasstabe.de/SQLele")?

What I'm trying to say here, what's the benefit of reinventing URLs that use dots instead of slashes in non-domain parts and for some reason require writing everything backwards? Why not use URLs as they already exist?

Max_Desiatov · January 30, 2021, 11:25am

This non-location-based identifier is already required to be passed as a name argument to the Package initializer. What makes it unsuitable for deduplication as is?

michelf · January 30, 2021, 12:11pm

There has not been that much discussion about this part: the idea that if the package identifier can makes its way into the language it'll allow disambiguating modules with the same base name.

With a reverse-DNS identifier, we end up making fully qualified names look somewhat like this:

import de.lukasstabe.Utilities

This follows the usual namespacing order: most general to more specific. It will also feel familiar to many people (Java packages and Apple's bundle identifiers), so I guess that's a small bonus. And then you can use it somewhat like this for fully-qualified names:

let view = de.lukasstabe.Utilities.CanvasView()

But while this is nice, it's not necessarily a requirement. The in-language syntax could be different from the in-package one. Or the in-language syntax could use a quoted string or some other separator:

let view = `lukasstabe.de`.Utilities.CanvasView()
let view = lukasstabe.de::Utilities.CanvasView()

How do package identifiers integrate in the language hasn't been discussed much yet. But the core team said they want to solve the clashing module name situation with package identifiers (or something like that), so we need a discussion about this. It seems to me this needs to be solved before a registry can take shape, as we'll most likely want the package identifier to be similar in-language and in-package.

finestructure · January 30, 2021, 12:35pm

Personally, I'd prefer if my package at https://github.com/finestructure/Arena was identified as finestructure/Arena or co.finestructure/Arena, simply because all elements in the name are mine.

I'm not even sure there exists a "self-owned" URL I'd want for the package name that'd also be neither weird or invalid. I.e. https://finestructure.co/Arena isn't valid unless I host something there (what even?) and if I used https://finestructure.co/swift-packages/Arena I'd still not now what exactly should be there nor is it a great package name.

I hope I'm making sense and haven't lost track of what the actual ramifications are. Feels like we're really deep in the weeds here

Frizlab · January 30, 2021, 1:09pm

If I understand correctly, currently the names are “simple” names (not namespaced), and collision are frequent (e.g. two Utilities packages).

Dante-Broggi · January 30, 2021, 1:48pm

So IIUC:
URLs are not good because they are too volatile due to location changes.
Hashes are not good because they are hard for people to remember.
A global namespace is not good because we want to be decentralized.
Reverse-DNS is URLs written funny.

Therefore:
How about using the tag URI scheme - Wikipedia

They can be constructed from either an HTTP URL or an email address, along with a date and are thenceforth time independent..

Max_Desiatov · January 30, 2021, 5:21pm

I don't think that reverse DNS (or any opaque identifiers for that matter) will solve this problem (again, unless you create a centralized authority that checks for name uniqueness). What would prevent multiple people from creating a com.utilities.Utilities package? Some number of posts above in this thread I already suggested a scenario, where a tutorial describes the creation of com.test.Example package. Enough people following the tutorial would then copy-paste and don't bother changing the identifier.

Simply hoping that longer identifiers will make collisions less frequent is not going to solve the actual issue, even if we consider it as a serious one. URLs on the other hand don't have such problem, and their uniqueness is guaranteed by the existing DNS system and actual servers that serve queried paths. In addition, this system is distributed and resilient enough, and wouldn't require SwiftPM users to change much if anything in their workflows.

Max_Desiatov · January 30, 2021, 5:30pm

Isn't any other system (i.e. registries, or a centralized package identifier resolution system) just as volatile due to the same reasons? Why is it reasonable to expect that a given package can move, but a registry can't?

You could say that a registry has a strong incentive to have its URL fixed, but what makes this different from the same incentive for a package URL to be fixed? If I know that enough people rely on my package, I'm going to set up an HTTP redirect or mirror the package as a whole Git repository to allow my users to continue using it. Shifting the responsibility from package maintainers to registries isn't going to make this any less volatile, but probably less reliable due to newly introduced complexity.

ahti · January 30, 2021, 10:20pm

To me at least, "I'm gonna publish a package people might like" sounds like much less of a commitment and potential for future headache than "I'm gonna set up a registry that others will use to find the packages they want". And I expect people choosing to do the latter will spend some more though on what it would mean to keep that running and available in the future.

But that won't be possible in all cases. Imagine a package published by devs at Somecorp, located, for whatever reason, at somecorp.com/AwesomePackage. Now Somecorp goes belly-up, gets renamed, merged into another corp, lays off their Swift team, whatever life may have in store. Do you expect some well-meaning developer who wants to step up to continue maintaining AwesomePackage to have both the connections and/or the (quite possibly substantial) spare change to get a hold of somecorp.com, just to set up a redirect?

Failing that, the package becomes unavailable, and in any package's dependency graph, all mentions of AwesomePackage will need to be changed at once, because pending the—in this example impossible to set up—redirect, newhost.com/AwesomePackage is by definition a different package than somecorp.com/AwesomePackage.

In a system where the package is identified by just the string (no proof of ownership) com.somecorp.AwesomePackage (or any other string, although I personally like how reverse-dns should reduce collisions quite naturally), the initial situation could be much alike: The devs published the package in Somecorp's registry at somecorp.com, which is now gone, so the package is unavailable.

But in this system, a mechanism for the user (or the package that sits at the top of the dependency tree) to specify where SPM should look for packages fits in nicely. Now the user can say something like "also look for packages you can't find elsewhere in newhost.com", or maybe just "get the specific package com.somecorp.AwesomePackage from newhost.com", which would take precedent over what is specified in dependencies' manifests.

This empowers the end-user to quickly take action and fix their build without forking a potentially huge number of packages, and—maybe, possibly, depending on how exactly this feature works, I'm not so sure here—might reduce the need for changing package manifests around across the ecosystem.

Now, one might say that a similar system could be devised while still using URLs for identity, but at that point the domain part would be kind of a false promise, no longer necessarily pointing where the package (or some hint at the package's actual location) can be found.

It would also be weird that either somecorp.com/AwesomePackage would need to remain the false-promise-identity forever (meaning every user would need to set up that "please look in this other location"-mechanism or have it set up for them by some automated mechanism), or SwiftPM (and every other package-processing tool where this is relevant) would need to somehow learn to be taught "no, even though the URL is by definition a package's identity, somecorp.com/AwesomePackage and newhost.com/AwesomePackage, even though their URLs differ, are actually the same package after all, please treat them that way."

To me the small promise of possibly verifying some sort of identity or ownership (well, in the moment at least) just doesn't seem like it'd be worth the inconsistencies, pitfalls and workarounds that would imo come with choosing to bind location and identity together.