URLs as Swift Package Identifiers

Reading this very long thread, I still don't understand why this:

com.github.michelf.mypackage

would be more of an issue than this:

github.com/michelf/mypackage

as unique identifier. The domain name system is the central naming authority in both cases.

@mattt correct me if I’m wrong, but it is because github.com is a domain that exists, and github guarantees the path /michelf/mypackage cannot be taken by somebody else than the user michelf.

Hardly, like @John_McCall said. Actually that is my main issue w/ SPM today and the reason I wanted to chime in on this proposal.

I’ve been bit by that once. Consider the following graph:
A depends on B and C, and B depends on C.
Then you change C’s URL in A’s Package.swift (just the source URL, the package is the same), but not in B’s.
When resolving packages, SPM is lost (or at least was when I last tried).

That’s why I thought using a reverse-DNS for a package ID was a good idea: not easy to have collision as long as people respect the convention, and gives the resolver a way to deduplicate the packages whatever the source.

However, I now realize 1/ this is probably out of scope and 2/ there is already the name property on a package definition which should fill this deduplication role.

I think an evolution should be made on SPM, but probably not related to this one.

Because com.github.michelf.mypackage can't route a package without a registry.

You'll run into the same problem with opaque identifiers as well, except that now you'll have to reason with the extra level of indirection created by a registry.

2 Likes

You did not get my point I think. I understand now using an opaque identifier for the registry is probably a bad idea, because indeed a central authority would have to tell whether one is allowed to push w/ a given prefix and such, or other kind of validation.

However, I still think SPM should have a notion of a package ID, which would not depend on the source of the package. This makes it possible to tell whether two packages are the same. In a decentralized environment, there must be something unique on which we can rely on to deduplicate packages when solving the package graph. This ID should be defined in the Package.swift file (because there is no central authority, it really cannot be defined anywhere else).

We should be able to use the name of the package for this purpose actually.
However, it is common to have two packages with the same name… that’s why it is commonly thought reverse-DNS is a better identifier.

But this is yet again out of scope from this thread; it does not really concern the registry the way it was constructed in the proposal. It should probably be a proposal on its own.

I'm still trying to catch up with lots of things, so just a couple of points I'd like to add (I don't think they've been mentioned before, but I might have just overlooked them):

This may be problematic. HTTP allows paths to be case-sensitive, and Git's URL-like just do whatever the underlying filesystem does. What happens if a service decides to allow users named mona and MoNa? One of them won't be accessible via this interface. Worse - users may write MoNa, paste the URL in to their browser and see MoNa's repository, but the package identity and code that is delivered will secretly be from mona.

It may not be a good design practice to allow case-sensitive usernames or repository names, but I'm not comfortable with SwiftPM restricting how code repository services do their business.

The good thing about defining our own identifiers is that we can invent restrictions like this without impacting any other services.

"too big to fail" may be more like it :sweat_smile:

FWIW, when it comes to URLs, I think that a co-ordinated effort from the big browser vendors to replace them would go down pretty well. There are so many problems with the current format that it really does just need a re-think (the worst one IMO? Percent encoding. Encoded components should have a marker so the client knows it needs to decode them; currently it's just guesswork and fragile gentleman's agreements).

There is a bigger issue here: by tying package identity and location together, it means that I'm giving GitHub some kind of partial ownership of my package identity. Sure, technically I can relocate, but then I'd be considered a new package by the ecosystem and force my users through a more painful migration.

One of the things that I think has been overlooked in this discussion so far is privacy. Even if I relocate to another provider, all requests for my package will continue to go through the old provider. I'm not a lawyer, but I believe that collecting data about who is requesting the packages is legal essentially everywhere as long as it is aggregated and anonymised. On the package author's side, they would presumably need to maintain their user accounts with the old provider and agree to their terms of service and data collection policies - otherwise they'd lose their package identity.

For example, let's say GitHub's terms of service change so that Microsoft can use all of this anonymised data (or maybe they can do that already); it would also mean they get to see statistics about who is requesting AWS's Swift SDK hosted on GitHub. Maybe they see a surge in downloads from France, and that gives them some early competitive insight for Azure that they otherwise wouldn't have had. It would be a lot more difficult for Amazon to relocate that AWS package in a way that avoids GitHub collecting that data at all.

(I'm not accusing GitHub or Microsoft of anything :slightly_smiling_face:, it's just an example of the need for privacy).

I think it should be a requirement that we can not only move packages between providers, but that we can do so while cutting the old provider from the process entirely.

4 Likes

If those are indeed different resources, then the host could redirect those to a case-encoded alternative. For example, here's how Go solves this problem:

To avoid problems when serving from case-sensitive file systems, the and elements are case-encoded, replacing every uppercase letter with an exclamation mark followed by the corresponding lower-case letter: Microsoft Azure · GitHub encodes as github.com/!azure.

For what it's worth, I don't know of any code hosting services for which this would be a problem.

That would certainly be nice, but I'm afraid this may not be possible. Identity and continuity are social constructs: In actuality, there is no such thing as Alamofire as a single, consistent entity over time, except by our collective agreement that this is the case. Philosophers have been grappling with these problems for a long time, as expressed through thought experiments like the Ship of Theseus and the Morning Star / Evening Star.

Technically, the best solution I've heard was to check the hash of the first commit for a repository. But that can be forged and may produce false negatives (rebasing) or false positives (incompatible forks or templates).

3 Likes

Key part of the core team's SE-0292 review feedback was to explore the question of identity - and more specifically, resolve the question of identities that carry location information such as URLs, compared to identities that are opaque with regards to location, such as simple strings or reverse-DNS.

The reason our choice of identity is so important is that SwiftPM today is not able to reliably deduplicate packages, and this issue is becoming urgent as the packages ecosystem grows. SR-11338 is a real-world example of such issue,
where a repository that was moved ended up in the dependencies graph twice (under different URLs), causing SwiftPM to fail.

As the ecosystem grows, and given that moving and renaming repositories is fairly common, this kind of issues will also become common. As such, the solution we choose must embrace and design for this reality.

The package identifier scheme we choose will also become the foundation for resolving module name conflict across packages, which is another problem becoming urgent as the packages ecosystem grows. The canonical example for this issue is two separate packages that both vend a "Utilities" module.

Today, Swift cannot deal with such module name duplication and the user must choose between the packages. If we add compiler support, we could support having two modules named “Utilities” from different packages by prefixing the module name with the package identifier.

The core team has expressed a desire to solve this long lasting issue by adding such compiler support, making it possible for SwiftPM to prefix the modules it generates. Since the module names need to be unique, the package identifier is a good candidate for such prefix, if we can make sure it is unique.

Here too we must embrace the reality that moving and renaming repositories is fairly common, and choose an identity scheme that would make for good module names over extended period of time.

Some package systems have no need to deduplicate packages because they only support languages in which code duplication is not a problem. However, that is not true of SwiftPM. SwiftPM cannot change the ABI of C, C++, or Objective-C to allow multiple copies of a package containing such code to be loaded into a program at once. Even in pure Swift packages, where such ABI changes are possible, package duplication can cause spurious build or runtime failures in programs that rely on the uniqueness of shared types and state. Therefore, reliable deduplication of packages is a basic requirement for SwiftPM.

As such, the core team asked in its SE-0292 review feedback that "This topic needs to be further explored in a dedicated forum thread in preparation to the next revision.". This is that thread.

This thread gathered a lot of attention, and has provided insight into the practical advantages and disadvantages of the two approaches. At this point, it seems like the technical arguments have all been laid out. While there are still different opinions, there is little new information coming from the ongoing discussion.

Since we have seen that both solutions could be made to work, and both carry a set of unique challenges, the remaining question is which one makes better tradeoffs?

The implication of location-based identifiers is that we would be asking SwiftPM users to manage URL mapping files, or set up proxies and other infrastructure to resolve dependencies deduplication issues. In our opinion, this is not practical at a significant scale and time horizon, and will cause continuous pain down the road.

In other words, location-based identifiers land the complexity on the end user, while opaque identifiers land the complexity on the package registries. From where we stand, putting the complexity on the registries would be setting the stage better for a future where we can reliably deduplicate packages and solve the module name clashes in the vast majority of cases.

SE-0292 is critical to the success of the Swift ecosystem. The core team has discussed this topic on our last couple of meetings, and concluded that opaque identifiers are a better fit to the Swift ecosystem for these reasons. While this thread is not a formal review, the core team wanted to share its position that SE-0292 needs to be amended to adopt opaque identifiers before it can be sent back to a second review.

11 Likes

I disagree with this characterization and assessment.

Because Swift packages can be forked, mirrored, and duplicated in the wild, automatic migration isn't always possible. The user must be able to intervene. Our proposal acknowledges this fact and provides mechanisms for users to resolve any conflicts directly.

Pushing this complexity to the registry doesn't solve any problems, but instead adds a level of indirection that will make it harder for users to resolve their issues. A cursory look at other systems, like Maven, reveals that central registries and opaque identifiers face the same problems.

Nor do opaque identifiers guarantee stable identity over time, as seen in Google's Best Practices for Java Libraries guide (e.g. junit.framework (versions 1.x-3.x) → org.junit (version 4)) as well examples like the left-pad incident and the renaming of libupskirt.

To the contrary, I think this thread has lead to important new insights that haven't been addressed by the Core Team.

As @mmarston points out, this discussion has been framed as "URIs vs. opaque identifiers", but it's actually a discussion about whether or not to move from a decentralized system to a centralized registry. None of the feedback I've received from the Core Team so far acknowledges this change, or the necessity of a registry of record for opaque identifiers to function.

Our discussion has also lead to some good solutions to address challenges of package relocation, like this idea from @hisekaldma:


Our team has been working on this proposal for over a year now, and we believe that it's the best solution available to the long-term health and prosperity of the Swift ecosystem. We've strived to be transparent and responsive to feedback from the community throughout the process, so it was both surprising and disappointing to get feedback about such a fundamental aspect of our proposal only after formal review, some 7 months after our original pitch and 3 months after submission.

If the Core Team insists on migrating our decentralized ecosystem to use opaque identifiers in a central registry, please understand that your decision is made over our strong technical objections. Nonetheless, we agree that SE-0292 is critical to the success of the Swift ecosystem, and will work with you to find the best solution that meets your requirements.

6 Likes

Does the Core Team / Apple accept that opaque identifiers require a strong, centralized authority to manage ownership? Do they further accept that Apple will almost assuredly have to be the one to host or otherwise pay for such a system? And do they accept that it can't be run like the App Store and that there can be no review?

3 Likes

I'm not entirely convinced that it does. We've been using the term "opaque identifier" here, but there is no reason we need to be limited to truly opaque strings like "foo" and "bar". If we add some structure to it, there might be opportunities to integrate with other systems to establish ownership.

If we recognised reverse-DNS package names as being special, we could then leverage the well-defined concepts of domains and subdomains. Perhaps we could require reverse-DNS package identities to be authenticated by some token (say, the registry hostname) signed with a certificate we can independently verify with the domain owner. So package org.swift.nio would have some extra verification with swift.org done at the client side.

2 Likes

The first link Dealing with "Xerces hell" in Java/Maven? shows an example where a package got published with multiple identities, but from the history it appears that situation arose because the package other didn't publish to the central repository, so multiple other parties did so, with different names.

The second Maven dependency resolution (conflicted) is a version conflict issue.

Yes, multiple publishers could publish the same source code under different package identifiers. And yes, opaque identifiers do nothing to address version conflicts. And yes, a package author may have reasons to change the opaque package identifier under which they publish new versions. But none of these are examples where the package identifier changed because the maintainer switched to a different source code hosting provider and that is because in those ecosystems, changing the URL where the source is located doesn't require changing the package identifier.

8 Likes

I really fail to understand why some people think this is needed.

If I want to create a unique opaque identifier, I can prefix whatever name I want with my domain name. The domain name system is a federated authority system.

When I said that earlier, @mattt answered those are not routable. But in what way being routable helps making the identity unique?


But what kind of unique is this all about? If I make a fork of something, it'll need to keep the same identity if I want to use it as a drop-in replacement. This is especially true if this identity gets attached to the module name at some point, becoming of the namespace for the package's content.

There's of course a second kind of identity: the identity of the entity from whom you want to get the package from. I think it's perfectly fine to say I want the com.apple.NIO package from github.com/michelf, which would refer to my own fork of Apple's NIO package.

Now two identities are intertwined: the one from the package (prefixed with "com.apple" because it originates from Apple) and the one from the maintainer (who made a fork). I think its fine if the later is a URL; it's where the maintainer keeps its packages (it's a small registry of sort).

It'd be nice if you could tell SPM I trust these four maintainers, and I want these ten packages and it'd just go fetch those maintainer's version of the packages.

6 Likes

Thanks for clarifying that. My broader point was that opaque identities don’t by themselves eliminate the potential for conflicts that require user intervention. You are absolutely correct that these aren’t the result of package sources moving.

Suppose you've registered the domain name michelf.me and suppose SwiftPM package identifiers use reverse-DNS. What is to stop someone else from publishing a me.michelf.malevolent package?

I think where you're going is that there doesn't have to be a central registry. There could for example be 3 equally popular SwiftPM registries, and they could all have a similar requirement that before publishing a package to the registry the publisher must demonstrate they have control of the domain name, similar to the process in Maven Central.

Is that what you have in mind?

It doesn't matter if someone does. It's just an identifier. The question is do you trust the source who is giving you the package... more on this below.

I think what I have in mind is being able to say this in a configuration file which package sources I trust:

  • I trust source gh.com/michelf for *
    (my own "domain", I trust everything there and it takes priority because it's listed first)

  • I trust source somefriendlywebsite.org/code for codes.vapor.fluent
    (my trusted friend's fork of fluent, hosted on their website)

  • I trust source gh.com/apple for packages com.apple.*
    (provide me with packages that originated from Apple, but I don't want to get any fork they might have made of someone else's packages)

  • I trust source gh.com/vapor for packages codes.vapor.*
    (provide me with packages that originated from Vapor, but I don't want to get any fork they might have made of someone else's packages)

  • I don't trust source gh.com, but please inform me of any missing package it can find so I can decide for myself.

Those rules could be part of a local configuration. A registry could be built based on rules like this too.

The source in the last bullet point is the public registry service for all users of gh.com. All others sources are individual users; in a way each user would have their own "registry" for the packages under their username.

The public registry's job would be to figure out a policy of which users are trustable/preferred for which packages identifiers. If the policy is that the original author to be trusted (based on the reverse-dns name), it could pair user accounts to domains by asking a user to confirm ownership using a .well-known file on that domain's website. For cases where someone doesn't have a domain, it could trust prefixes in its own domain like com.gh.michelf.*. Those are only suggestions: each registry can implement its own trust policy of course.

7 Likes

Isn't this, in the words of @tomerd "landing the complexity on the end user"? Is this something that the average Swift developer or iOS developer will want to do? Will they be good at it? (i.e. will they set up these policies well enough to avoid malicious code slipping into their app?)

Is each registry expected to have a full closure of dependencies? In other words, packages in registry A can only depend on other packages in registry A. Then users of registry A will have to configure some other registry to get the dependencies not found in A.

And if each registry implements its own trust policy, what if I need package p from registry A and package q from registry B, and p and q both depend on a package with id r, except the package named r in A is completely unrelated to the package with the same id in B.

This brings up an important point. I like the idea of this configuration; it is complex but it is also expressing a complex use case. Such configuration should be possible. But what about the "default" configuration? If there are 3 popular Swift registries, does SwiftPM ship with a default configuration that lists a set of "trusted" registries or do we expect each developer to understand about registries and which ones to trust.

1 Like

Not necessarily. Remember that with that scheme gh.com/michelf is itself a mini-registry for the packages of that user; it likely won't contain much and will depend on packages from elsewhere.

I'd expect most users would use a public registry with all or most of the packages they want. But they can add others if they have reasons to do so.

Only one of the two r packages is fetched and something doesn't compile.

With one-letter package names, this is bound to happen. Package identifiers should be based on reverse DNS or some other uniquing scheme.

That is the best case. Worst would be it does compile fine and the one that was fetched is a malicious fake.

I wasn't suggesting that a single letter package name was used, r was just a placeholder. I contrived this scenario in response to the statement "each registry can implement its own trust policy of course". If multiple repositories do not both have a similar policy then different publishers could publish different packages with the same package ID in different registries (for example, if one registry requires a DNS-based check and the other does not).

2 Likes

This is just "URLs as identifiers" with more steps. If the Core Team doesn't want actually opaque identifiers that have no meaning outside package identity, they need to say so. But given the conclusion of @tomerd's final paragraph comes after listing various issues with domain-based identifiers, turning around and saying a reverse DNS identifier used for DNS purposes is both opaque enough and doesn't fall afoul of those issues seems unlikely to succeed. Barring Core Team clarification of course.

Just like @Karl's suggestion, this is neither opaque, nor does it address the many drawbacks of domain-based identity the Core Team used to draw their conclusion. And as others have stated, your proposal puts the onus entirely on the end user, which was another reason for the Core Team's conclusion.

If the requirement is opaque identifiers which require no end user configuration, don't require packages to link identity outside identifiers like domains, but still require unique and controlled identity, a root registry seems like the simplest option. It's an extremely familiar design domain with well known benefits and pitfalls and a shallow learning curve. Trying to build some sort of federated design that can still guarantee unique identities without having to worry about hostile collisions doesn't seem to bring any benefit to the end user or the Swift ecosystem. To my mind, the only reason to build such a system would be if Apple refuses to sponsor the design and hosting of a central repository. Then, and only then, would we need to provide some sort of federated service. But such a service would be all around inferior to a single, centralized, registry.

4 Likes