URLs as Swift Package Identifiers

Karl · January 27, 2021, 3:43am

No it isn't. Actually, it addresses another big issue with this proposal: for all its talk of using tried and true technologies of the web, it actually has a totally different security model.

When I visit apple.com, how do I know the content I receive is really from Apple? Because they have a certificate, linked to that domain, verified by a third party in my chain of trust, and their private key is used to encrypt the content they send me. The origin, which is critical to the web's security model, would only include the components (https, apple.com, 80).

This package identities in this proposal are entirely different: they are tuples of the form (github.com, apple/swift-nio). They are inextricably bound to one particular registry or hosting provider, and not at all bound to the package author in any way we can independently verify.

I don't know how you arrived at these requirements, or any kind of general dissatisfaction with reverse-DNS not being opaque enough, from tomerd's post. In fact, I read the opposite:

The concern is that package identifiers should be opaque with regards to location. Reverse DNS allows us to have that opacity while allowing the package author's identity to be independently verified.

Jon_Shier · January 27, 2021, 3:58am

You're suggesting using reverse-DNS identifiers to lookup a domain in the actual DNS system and then using that information to look up a certificate. How is that not just another URL? And how is that opaque? How is that not a location?

There are no tuples in the original proposal. And you can quite easily independently verify the ownership of that information, it just relies on the identity provided by the host rather than some combination of random domain ownership and certificates. I would trust the identity provided by GitHub far more than what you've proposed. But in the end, it's all the same information, you've just rearranged the parts and called it different. And again, puts the onus on the publisher to somehow establish identity and on the user to properly see it.

This is rather off topic, but this isn't how you know apple.com is actually Apple's website. That's only established by DNS ownership records. If someone were able to take control of those records, they could easily provide their own certificates that look just as legitimate as Apple's do now. All a certificate proves is that someone was able to get it signed for a domain by a root authority.

Karl · January 27, 2021, 4:28am

Expanding the part of tomerd’s post that I quoted above (it’s the first paragraph):

The question is about identities which carry location information vs. those which do not. The subsequent paragraphs, as I read them, make clear that “location” refers to the registry (e.g. its hostname), not the author’s identity.

Reverse DNS names are not opaque by definition - just by describing them as reverse DNS, you impose some structure which can be parsed. But a domain name linked to the author is still opaque with regards to registry hostname (or “independent of registry hostname” might be clearer).

Semantic tuples, of (registry hostname, string derived from URL path).

Indeed - the point I was making, and that you have reiterated, is that existing web infrastructure links an entity’s identity to its domain, the certificate is issued for a domain, etc. This proposal manages identity without using the author’s domain, but instead embedding the registry’s domain in to the package ID and forever delegating the issue to that registry.

One could argue about how “decentralised” that really is (as I mentioned before, while on some level it is decentralised, it comes with severe privacy drawbacks which limit how effective decentralisation can be in practice).

John_McCall · January 27, 2021, 5:19am

The core team considers it unacceptable to have the repository URL be the identifier of a package. That is the current state, and it is already tangibly causing problems for the package ecosystem. The solution needs to be able to accommodate the same logical package being found at different locations. It also needs to include some sort of non-location-based namespacing that will discourage artificial package conflicts.

Beyond that, we are not trying to constrain the pitch to favor any particular approach. In particular, we are neither insisting on or even suggesting a centralized registry of either packages or domains. If you think a non-location-based package identifier necessarily requires such a centralized registry, or is substantially flawed without one, I think that a thoughtful post laying out that case would be very interesting input for the pitch.

mmarston · January 27, 2021, 8:22pm

I think the community is having a hard time coming up with a non-location-based package identifier scheme that doesn't involve a central registry as the authority/arbiter of what parties have claimed a particular package identifier and are authorized to publish packages with that identifier.

One option for non-location-based package identifiers is that package authors can choose an arbitrary name as a package identifier, similar to the status quo with CocoaPods, crates.io, PyPI and (unscoped) npm package names. Examples of such names are Alamofire, tokio, jupyter, and lodash. If this approach is taken then how does the community avoid conflicts due to duplicate package names without a central registry?

Assume this approach was used with multiple registries. When I want to create a new package, do I have to check each registry to be sure the name is not already in use? And once I pick a name, should I go register the name on each registry? Or should the registries be federated, such that once an name is registered in on registry the other registries won't allow that name to be used to publish a package in their registry?

Instead of using an arbitrary package name as the package identifier, we could add a namespace component, similar to npm scopes. This allows package creators to come up with new names within their namespace without any conflict. Also, once an organization or project has reserved a namespace, it also allows developers to know that when they see a new package in that namespace that it came from that organization. For example, new AWS CDK npm packages published by AWS are in the @aws-cdk scope. Anyone else can publish npm packages containing CDK constructs, such as cdk-datadog-integration but it is clear that package isn't published by AWS because it isn't in the @aws-cdk scope.

Another benefit of namespaces is that an organization can reserve a namespace and then publish packages within that namespace in a private registry, and they don't have to worry about another party publishing a public package with the same package identifier.

Namespaces provide value in a package ecosystem, but if namespaces are arbitrarily chosen then now the questions around name conflicts just shifts to namespaces. How does the community avoid namespace conflicts without a central registry?

The only suggestion that I've seen to avoid arbitrarily chosen namespaces is to use a namespace that is based on DNS and have the registry administrator perform some checks, whether automated or manual. For example, Maven uses reverse-DNS, and the two main public Maven repositories (the Central Repository and JCenter) both have a review and approval process before authorizing a publisher to publish packages in a specific namespace.

Does the community have any alternative suggestions?

Some questions we should consider for any package identifier scheme:

As a package creator, how do I choose a unique package identifier?
As a SwiftPM developer, how do I avoid conflicts due to duplicate package identifiers in my dependency graph?
As a community, how do we prevent bad actors from creating a duplicate package identifier or typo-squatting (using a package identifier that looks like a well-known package)?
As a company that uses a private registry, how do we choose private package identifiers that will never conflict with package identifiers someone else publishes to a public registry? (unintentionally or maliciously)

michelf · January 28, 2021, 12:48am

Here's an idea for an alternative scheme. It's surely incomplete and I'm not sure this is where we should be going, but I think it's a possibility worth mentioning.

Use the hash of a public key as identity.

Git can sign commits and tags, so you can check if the fetched commit or tag matches the same public key as the one hashed in the package's identity. A package identity would then be like:

// <public-key-hash>.reverse.dns.packagename
8v7nJA0Pam712FcJA91k.com.example.PackageName

We could also allow the first part (the hash) to be omitted, then use a bit of DNS+HTTP magic to find the official public key hash of example.com. So if we have this:

com.example.PackageName

we then fetch https://example.com/.well-known/spm_keyhash to get the public key hash representing the official identity, and then add it back:

8v7nJA0Pam712FcJA91k.com.example.PackageName

If your private key is lost or stolen and you are using a domain, then you can update the .well-known/spm_key file and everyone will shortly be using the new key.

There's a big shift in this scheme: you don't have to trust the registry as much as before. Anyone can trivially validate the authenticity of a package.

On the other hand, this system might require too many changes to be viable.

QA:

Why a public key hash and not the public key?
Because it can be kept short. Cryptocurrencies also demonstrated that this is secure.
Where do you find the public key if you only have the hash?
The public key is part of the signature of a commit or tag.

As a package creator, how do I choose a unique package identifier?
Your public key hash is at the start of the identifier. Add a reverse domain name (optional). Add your package name.
As a SwiftPM developer, how do I avoid conflicts due to duplicate package identifiers in my dependency graph?
If it does happen, then the same person created two packages with the same name. Unless a private key has been stolen or reused, in which case a new identity should be created and the .well-known/spm_keyhash file on the HTTP server should be updated by whoever this package belongs to.
As a community, how do we prevent bad actors from creating a duplicate package identifier or typo-squatting (using a package identifier that looks like a well-known package)?
It's hard to forge public key hash that validates and ressemble another one. But the shorter form (reverse DNS only) might be slightly vulnerable if someone manages to register a lookalike domain.
As a company that uses a private registry, how do we choose private package identifiers that will never conflict with package identifiers someone else publishes to a public registry? (unintentionally or maliciously)
Use your own public/private key pair.

drewster99 · January 29, 2021, 7:02am

I don’t think URLs as identifiers is a great plan.

What about the same package hosted multiple places?

You could also have the site for a URL just go dark. Not nicely giving out redirects — just dark.

The real issues a registry system needs to solve are:

1- discoverability
2- redundancy
3- code authenticity

Having a registry (or better, registries) is a good thing. I’d like to see multiple registries where anyone who conforms to the spec can host one, but I’d also like to see a bespoke one, perhaps run by a community group, that is carefully curated, with rules and checks-and-balances designed to promote high quality. I have no idea how that would happen.

What about this—

Maybe package identifiers similar to bundle identifiers, starting with reverse DNS. Add a system of DNS entries that describe servers eligible to host packages with package ID prefixes matching the reverse DNS, and a public key.

In that way, you could specify a package ID rather than a URL. DNS lookup finds valid servers. Servers vend specific package URLs. Packages are signed with private key.

Or maybe DNS can give either a URL (URLs) for the package and/or delegate that ability elsewhere.

Then you can say, these 2 registries can host my package repo, but I still have my own private key (not the registries), and anyone downloading my package can find multiple places to get it from, and they can know if I signed it, and if their download source is authorized.

Something like that?

Max_Desiatov · January 29, 2021, 8:00am

I don't think that the presence of a registry would make this any more reliable. It could go dark with the same effect to the end user.

Diggory · January 29, 2021, 11:30am

It seems to me (a hobbyist idiot) that reverse DNS is a proven workable system. How many collisions are there in the real world for bundle ID’s?

If there were a collision (say due to malicious actor) perhaps a have a fallback where the un-reversed DNS is used to check the canonical url via a file on a web server at that domain. E.g. in a .well-known/spm.swift file.

Max_Desiatov · January 29, 2021, 11:46am

I personally experience this constantly. Clone any open-source iOS app from GitHub and try to build it locally. It will fail because of the collision, and then you need to come up with a new bundle ID again just to build an app locally.

This is the developer experience you get when working with a well-funded well-established commercial App Store, which is at least a decade old at this point. How is this supposed to scale in a chaotic world of Swift packages if it doesn't work really well in a controlled commercial centralized environment of bundle IDs?

Diggory · January 29, 2021, 11:56am

Isn’t that because Apple enforces bundleID uniqueness globally? In the decentralised SPM world, if you own the domain then you would own the reverse DNS ID.

I suppose it’s pushing the authority role to domain registrars. (As I believe was mentioned upthread). Owning a domain is not a difficult task these days (which one is constantly reminded of if one listens to the ads in podcasts).

Max_Desiatov · January 29, 2021, 12:09pm

What's the benefit of reverse DNS then? Why do I have to refer to my package as com.github.MaxDesiatov.XMLCoder, when now I can refer to it already as GitHub - CoreOffice/XMLCoder: Easy XML parsing using Codable protocols in Swift? Why is there a need for github.com/.well-known/spm.swift file if there's already a Package.swift file there that already proves the package is valid?

It feels like 150 posts later we're going in circles here, but I still haven't seen a concise explanation for why stripping the https:// part of a URL and rewriting the rest of it backwards while replacing slashes with dots is an improvement?

I don't want to diminish the work of the proponents of the reverse DNS notation, but I feel like I completely misunderstand their argument(s), which range from "what if package hosting goes dark" or "what if the URL changes?" Somehow the same exact questions applied to registries and the complexity introduced with uniqueness resolution are not addressed. What am I missing?

Diggory · January 29, 2021, 12:16pm

You wouldn’t have the well known file at GitHub.com because that domain doesn’t belong to you. The idea is that for little cost and little complexity you can register your own domain (or use an existing one that you own). Then if the url where the repo changes or becomes unavailable you change the canonical url at the domain that you control. It’s sort of decoupling the identity and the location. Again, I’m an idiot and haven’t thought this through fully, but URLs seem fragile to me and unchangeable in cases where you don’t own the domain.

If owning a domain is too high a bar, then you don’t have to, but you run the risk of collisions.

Max_Desiatov · January 29, 2021, 12:22pm

This isn't "a little complexity". Right now I can tell literally anyone, even absolute beginners: push your repository to GitHub (or GitLab for that matter) for free, navigate to it, copy the URL from the browser address bar, paste it here in Package.swift to add a dependency to it, you're done.

What fraction of people is going to abandon even a thought of publishing a package if that process requires a domain ownership and then deploying a correct .well-known/spm.swift file there? How is this less fragile?

Can you give a real-world example of such collision please? Additionally, how exactly would such collision be avoided with a reverse DNS notation (or other proposed scenarios) then?

Diggory · January 29, 2021, 12:27pm

You make a good point re: complexity.

There are collisions at the moment, because the ID is simply the name. The two potential future directions are urls (in which case collisions would not happen) or reverse DNS (in which case there could be collisions unless there was a system for checking authority).

Matt_McLaughlin · January 29, 2021, 6:19pm

I wonder if there isn’t a two-tier solution. You’re not wrong that DNS is onerous for the “throw the cool thing I’m working on into GitHub so others can play with it” use case. But are those types of packages ever afforded the level of trust that users of a registry are looking for?

What if we use the DNS/.well-know approach but offer affordances for those “quick but collidy” packages on github?

From a registry perspective you can guarantee identity for some packages and just be transparent that others have a different level of trust. Let the user decide what to do with that.

In practice if an exploratory package gets popular enough the maintainer can go ahead and register his domain name and “officialize” it.

Low barrier to entry at the bottom, high level of trust at the top.

Max_Desiatov · January 29, 2021, 6:28pm

I'd definitely trust GitHub or GitLab and the existing DNS and HTTPS infrastructure more than a random registry that uses a convoluted system of placing files at specific location and expects people who do this to configure everything properly.

Specifically, what SwiftPM does right now has worked pretty well. I don't remember of any examples of malicious packages wreak havoc in our ecosystem, as opposed to what happened with NPM multiple times, which is a centralized registry that uses opaque identifiers for package identity.

tkrajacic · January 29, 2021, 6:48pm

I don't think the issue is about trust as much as hardcoding the hosting provider. I for example would like to move my package away from GitLab to GitHub, and it should not lose its identity by that. That's the issue as I understand it. And requiring access to the old URL is also a complete deal-breaker as the reason for moving will often exclude that option.

I definitely agree with you that the discussion is moving in circles lately, but a solution is also not trivial.
It feels like it would involve some form of decentralized trust system, yada yada yada…

Max_Desiatov · January 29, 2021, 6:54pm

Mirroring code from GitHub to GitLab is trivial. LLVM was mirroring its code from its own server to GitHub for years. Although it's not a SwiftPM package, it shows how flexible this solution is, as it works for any package ecosystem. But to be fair if this is shown as a disadvantage as the status quo, please describe what alternative do you propose?

If I'd like to move my code from one registry to another, or from one identifier to another, I'd lose the identity too. What's the benefit then of overhauling the whole system only to land at a similar set of trade-offs, but with increased complexity?

tkrajacic · January 29, 2021, 6:56pm

Hehehe, that's what all the fuzz here is about

But I mean you are not wrong. In fact nobody here is. That's the problem
There is many good arguments for all kinds of "sides", yet they all come with drawbacks that make them somewhat unsatisfying.