URLs as Swift Package Identifiers

mattt · January 15, 2021, 6:16pm

Thanks for sharing this, @yim_lee . Responding to your points:

Our proposal doesn't bind identity with location. It starts with that assumption, since that's the case most of the time, but provides several mechanisms for the user, package maintainer, and registry to decouple that relationship.

A user can specify a mirror URL for a package URL
A user can specify custom registry proxy, which has direct control over how package URLs are located
A package maintainer can specify a canonical location for the package
A registry may redirect a URL to a different location

Our proposed scheme doesn't create vendor lock-in. A package maintainer can specify a rel="canonical" link in the response for GET /{package}. Swift Package Manager could use this information for deduplication and communicate that to the user (e.g. '"github.com/apple/swift-nio" has moved to "swift.org/server/swift-nio"')

Go uses URIs to identify modules, and that doesn't seem to have been an issue for artifact repositories.

https://www.jfrog.com/confluence/display/JFROG/Go+Registry

https://search.gocenter.io

In response to feedback from the Amazon CodeArtifact team in a previous thread, I created a proof-of-concept that implements Swift package registry support using the AWS SDK:

For what it's worth, newer languages like Go, Rust, and Deno, use URLs to identify packages.

This is a bit of a subjective claim. I think you could say the same about URLs — especially since that's how Swift already specifies external dependencies.

Sorry, but I don't understand why an opaque identifier is unique or unambiguous, but a URI isn't. Can you provide an example of a situation that works for opaque identifiers but not URIs?

I haven't worked extensively with Maven, but my impression is that this naming scheme is not always unambiguous. For instance, Google's Best Practices for Java Libraries guide lists a few examples of naming conventions across different projects:

Examples in open source

Case 1 - Keep Java package name and Maven ID

Guava

Hibernate

Joda Time

Case 2 - Keep Java package name, rename Maven ID

guava vs guava-jdk5

This technically wasn't a new major version, but it is an example of case 2
that has caused a lot of problems.

javax.servlet:javax.servlet-api:3.1.0 vs javax.servlet:servlet-api:2.5

Case 4 - Rename both Java package and Maven ID

Square has established this approach as a policy for its Java libraries.

OkHttp (com.squareup.okhttp -> com.squareup.okhttp3)

Retrofit (com.squareup.retrofit2 -> com.squareup.retrofit2)

Apache Commons Lang (org.apache.commons.lang -> org.apache.commons.lang3)

RxJava (rx (version 1.x) -> io.reactivex (version 2.x))

JDOM (org.jdom -> org.jdom2)

jdeferred (org.jdeferred -> org.jdeferred2)

Case 5 - Bundle old and new packages in the existing Maven ID

JUnit (junit.framework (versions 1.x-3.x) -> org.junit (version 4))

I think this is actually an anti-feature.

In our off-thread discussions about this, you've used the example of moving from GitLab to GitHub (or vice-versa). If you believe that these services are interchangeable, this is an innocuous change. However, you might feel differently if a package moves to a service you're not familiar with, perhaps hosted in another country, say China or Russia — or indeed to the US. This speaks to my primary objection to using an opaque package identifier.

Incorporating external dependencies is, fundamentally, a matter of trust.

With URI-based package identifiers, the model is simple: A user says "I trust the package hosted on GitHub.com located at the path /mona/LinkedList, and its external dependencies." Addressability plays an important role in trust as well. If your project depends on [github.com/mona/LinkedList](http://github.com/mona/LinkedList`), you can go to that URL and find the source history.

My understanding of the alternative you've described has a very different trust model. A user not only has to decide what packages to trust, but also what registries to trust, and in what order. The package "org.swift.swift-nio" may appear trustworthy, but lacking fundamental addressability, it's unclear how this can be independently verified.

Our proposal for identifying packages by URI satisfies both of the stated requirements.

We've described in detail how everything works, in the proposal, the service interface description, and OpenAPI specification. We've provided a working implementation and a benchmark harness that anyone can use to try out registries yourself, today. We wrote a reference implementation for the registry server and a proof-of-concept of how artifact repositories can add Swift registry support.

If you feel strongly about an alternative solution — one that we considered earlier in the design process, but ultimately rejected — then I think it'd be very helpful to get into specifics before concluding that to be a better option.

This would not be a trivial drop-in change. Content negotiation and HTTP semantics are a core component of this proposal, and moving from URIs to opaque identifiers would require substantial modification to both the server specification and the client implementation.

NeoNacho · January 15, 2021, 6:26pm

In my mind, this is fundamentally at odds with

We can either have a system where we get implicit trust from the location of a package as specified in the manifest or allow unbinding identity and location as you described, but we can't have both.

johannesweiss · January 15, 2021, 6:28pm

mattt:

Incorporating external dependencies is, fundamentally, a matter of trust.

With URI-based package identifiers, the model is simple: A user says "I trust the package hosted on GitHub.com located at the path /mona/LinkedList, and its external dependencies." Addressability plays an important role in trust as well. If your project depends on [github.com/mona/LinkedList ](http://github.com/mona/LinkedList`), you can go to that URL and find the source history.

My understanding of the alternative you've described has a very different trust model. A user not only has to decide what packages to trust, but also what registries to trust, and in what order. The package "org.swift.swift-nio" may appear trustworthy, but lacking fundamental addressability, it's unclear how this can be independently verified.

Sorry, I'm just jumping in on this one. You claim that the trust model for URI-based identifiers is simple. It's simple if and only if there is no mirroring feature. If you have configured a mirror which say mirrors https://github.com/apple/swift-nio to https://evil.corp/super-fast-stuff/swift-nio you have to now trust your mirror.

You may say that people carefully audit the configured mirrors but I could make the same argument with regards to package registries.

In fact in a corporate setting you'll likely not be able to reach github.com because relying on external services to build your code means that you have a dependency on github.com always working and not rate-limiting you. That means in a corporate setting, I'll have to configure a mirror for every package I use anywhere. This will become a very long list and it could be easy to miss if a malicious mirror is configured in this list.

I recognise that this is a guess but I'd think that the list of trusted registries will likely be shorter (and more audited) than the list of configured mirror packages. Of course this depends on the mirror configuration, we could envision a mirror configuration that maps anything to https://my-corp.mirror/swift-packages/$PACKAGE_BASENAME or so.

mattt · January 15, 2021, 6:33pm

I disagree. This is a delegation of trust from the package maintainer to the registry.

Mirrors are used for one-offs, like pointing to forks. In the case of an internal network, you'd use an intermediate registry proxy.

mattt:

Intermediate registry proxies

By default, the identity of the package is the same as its location. Whether a package is declared with a URL of https://github.com/mona/linkedlist or git@github.com:mona/linkedlist.git , Swift Package Manager will — unless configured otherwise — attempt to fetch that dependency by consulting github.com , which may respond with a Git repository or a source archive (or perhaps 404 Not Found ).

A user can currently specify an alternate location for a package by setting a [dependency mirror][SE-0219] for that package's URL.
$ swift package config set-mirror \
--original-url https://github.com/mona/linkedlist \
--mirror-url https://github.com/octocorp/swiftlinkedlist
Dependency mirroring allows for package dependencies to be rerouted on an individual basis. However, this approach doesn't scale well for large numbers of dependencies.

Swift Package Manager could implement a complementary feature that allows users to specify one or more registry proxy URLs that would be consulted (in order) when resolving dependencies through the package registry interface.

For example, a build server that doesn't allow external network connections may specify an internal registry URL to manage all package dependency requests.
$ swift package config set-registry-proxy https://internal.example.com/
When one or more proxy URLs are configured in this way, resolving a package dependency with the URL https://github.com/mona/linkedlist results in a GET request to https://internal.example.com/github.com/mona/linkedlist .

A registry proxy decouples package identity from package location entirely, which could unlock a variety of compelling use cases:

Geographic colocation : Developers working under adverse networking conditions can host a mirror of official package sources on a nearby network.

Policy enforcement : A corporate network can enforce quality or licensing standards, so that only approved packages are available.

Auditing : A registry may analyze or meter access to packages for the purposes of ranking popularity or charging licensing fees.

NeoNacho · January 15, 2021, 6:39pm

Doesn't this mean you agree? :) If there's a delegation of trust, there is no longer the implicit trust of seeing a literal location in the manifest, which to me sounds like the exact same trust model we would get from opaque identifiers. What am I missing here?

johannesweiss · January 15, 2021, 6:41pm

Which I have to trust?

mattt · January 15, 2021, 6:50pm

With URIs, the user makes a single decision to trust a package maintainer. With opaque identifiers, you also have to specify which registries to trust.

Yes? I'm not sure why you'd set an intermediate registry proxy or a mirror on the client that you don't trust.

abertelrud · January 15, 2021, 6:52pm

Would the package maintainer not need to specify this on the original host with the old name, though? In that sense there is still a lock-in.

One example is in regards to the different transport mechanisms. One concrete example is a non-public repository server that vends URLs of this form:

ssh://git@server.example.com/~somebody/repository.git
https://server.example.com/scm/~somebody/repository.git

The added scm component means that there's no mechanical transformation from one to the other. In previous replies in this thread there have been other examples of the ambiguity of interpreting URLs.

johannesweiss · January 15, 2021, 6:56pm

I really struggle to see the difference to why those two are different:

setting an intermediate proxy registry I don't trust
setting a package registry I don't trust

mattt · January 15, 2021, 7:24pm

If I'm a maintainer of a package hosted on GitHub.com and I want to move to another hosting provider or host it myself, I can tell GitHub to set the new location as the canonical URL for that package.

abertelrud:

One example is in regards to the different transport mechanisms. One concrete example is a non-public repository server that vends URLs of this form:
ssh://git@server.example.com/~somebody/repository.git
https://server.example.com/scm/~somebody/repository.git
The added scm component means that there's no mechanical transformation from one to the other. In previous replies in this thread there have been other examples of the ambiguity of interpreting URLs.

The same is true of opaque identifiers, too. The same package could be published as com.example.repository, com.example.repository2, and org.example.mirror.repository. This is what I was trying to convey before: in an open ecosystem, no identity scheme can prevent the same package from being made available under different names.

Maybe I'm misunderstanding the meaning of "unique and unambiguous"?

Correct me if I'm wrong, but my understanding of your original point was that a short list of registries (in an opaque identifier scheme) would be easier to audit than the possibly long list of mirrors necessary to make resolution work in a corporate environment (in a URI-based scheme). My point in response was that you wouldn't have a long list like this, and instead delegate all package routing to an intermediate registry proxy.

Does that make sense?

johannesweiss · January 15, 2021, 7:34pm

It does make sense. I shouldn't have mentioned the long list vs. short list bit it just makes things more complex. And you also pointed out that even with the URI-based package resolution we can have just a short list (the intermediate registry proxy).

The core of my argument however is: Much like with opaque identifiers, in the URI-based scheme I cannot just trust what I see in the URI. I need to also trust that everything is set up correctly and I didn't configure a malicious registry/intermediate registry proxy.

So from a trust point of view I see no difference: URIs may look like I know where the code comes from but I don't actually know unless I audit the full configuration too. The same applies to opaque identifiers but at least it's more obvious that this is the case.

Which would mean that we'd be relying on source code host (eg. GitHub) having implemented that feature as well as allowing the package maintainer to use it.

mattt · January 15, 2021, 7:51pm

Here's another way you could say it:

A URL is routed directly unless you change the configuration, either by setting a mirror or setting an intermediate registry. This is the same as how things work now with external dependencies and mirrors.
An opaque identifier is meaningless on its own; how that's routed is entirely dependent on configuration.

I agree that in both cases, configuration can confound expectations. To that end, I started to write up a swift package discover command that would help end users understand how packages are resolved — as well as other useful information, like time spent downloading and size on disk.

Yes, correct.

Max_Desiatov · January 15, 2021, 8:28pm

I hope this doesn't have to be so tied to Git hosting, another way to implement these redirects is to rely on .well-known files.

Overall, I fell like going away from URLs to some arbitrary identifiers not only would be a significantly breaking change, but also would abandon trust of the web that had already been established through DNS and TLS.

Anyone can create a com.test.testpackage and introduce a naming collision that way. In all seriousness I would expect this instantly becoming a problem, where some tutorial will hardcode package ID and with people copy-pasting that and hosting their packages without changing that ID. Or swift package init generating a placeholder ID, which some significant amount of people will never bother changing.

At the same time, if you trust your DNS and certificate authorities (which I think most of the users do), the problem with naming collisions go away. Only one package can be hosted on test.com/testpackage, and infrastructure for redirects on the web already exists. I think it's indicative that recently built ecosystems like Go, Rust and especially Deno (which could've just relied on npm) have chosen to use URL's instead of arbitrary identifiers.

I disagree that opaque package identifiers are easy to reason about and have a short learning curve. Application bundle IDs used on iOS and macOS, which are arguably just opaque app identifiers, are a great example. "What is a bundle ID, which one should I choose, and why does look like a domain name in reverse?" are a few very frequently asked questions I hear from beginner developers for Apple's ecosystem. Honestly, I still don't have good solid answers to most of those questions. Headaches that conflicting bundle IDs cause are major, just have a look what macOS Catalyst developers are dealing with. If solutions to these problems weren't found on the massive Apple's scale, I strongly doubt they will be found if we'd have to deal with them in SwiftPM.

Mind that App Store bundle IDs are managed in a centrallized way by Apple. If SwiftPM is to stay decentralized, how will that not exacerbate the issues?

Jon_Shier · January 15, 2021, 8:40pm

Exactly. Don't unique arbitrary identifiers require a central authority by definition? It doesn't really matter if it's just a name or a reverse DNS id, if you can claim it arbitrarily, you need a service that guarantees ownership over that identifier. This is what the Apple ecosystem has with CocoaPods, and it works well, but doesn't seem like the model Apple wants for SPM.

Max_Desiatov · January 15, 2021, 8:49pm

I would argue this did not work well in a lot of cases. I lost count for how many times I forget to upload a pod spec after tagging a new version of a library. There's also the fact that hosting forked libraries on CocoaPods is particularly hard, as everything lives in the global namespace. You can work around that with separate spec repositories, which essentially becomes an ad-hoc decentralized solution. No such problems with SwiftPM so far.

The web is already decentralized, and has DNS for naming resolution, and there are enough people working on issues of trust. I hope SwiftPM can build on top of those. As they say, "don't roll your own cryptography, as your own version will never be as good as what is already out there", I hope we can avoid rolling out our own version of DNS and certificate authorities.

Jon_Shier · January 15, 2021, 9:00pm

I find this to be a benefit. Under a decentralized system like the one proposed, no matter the identifier, I'm not really looking forward people being able to fork Alamofire onto GitLab and still call it Alamofire.

Max_Desiatov · January 15, 2021, 9:04pm

The thing though is that the library is currently called GitHub - Alamofire/Alamofire: Elegant HTTP Networking in Swift, and on GitLab it would be called gitlab.com/fork-user/Alamofire. How would this be different if under a proposed centralized system with opaque identifiers you have org.alamofire.alamofire, and someone forks it and calls org.a1amofire.alamofire? Or if I upload my fork to CocoaPods as A1amofire?

Jon_Shier · January 15, 2021, 9:08pm

It's different because you've been forced to change the name. Additionally, a centralized authority could have a process in place to handle complaints of ambiguous or conflicting identifiers, similar to how Apple would handle someone trying to publish a Facebo0k app on the store. Decentralization really hurts in this regard.

My disagreement with this sort of decentralized system is beside the point of this thread though.

NeoNacho · January 15, 2021, 9:13pm

This is also a breaking change, though, since today the last path component is used for package identity. I feel like we keep on forgetting about this in the discussion, just because we keep using URLs does not mean we have backwards compatibility.

Jon_Shier · January 15, 2021, 9:14pm

I guess, fundamentally, my opposition to both the URL or other unique identifier is that it doesn't change the module name that the user actually interacts with on a regular basis. AFAICT neither proposal introduces any bundle name disambiguation to the language either, meaning it's perfectly possible for users to use packages that aren't really the module they say they are.