URLs as Swift Package Identifiers

Max_Desiatov · January 15, 2021, 9:17pm

I'm not debating this one. My point is only about the degree of change. Using URLs as package identifiers means that packages without dependencies will probably not notice it at all, while packages with dependencies may need some (I hope relatively small) changes only if they relied on the fact that the last path component was the ID.

Coming up with unique opaque identifier would touch all package maintainers, no matter how many dependencies they have.

NeoNacho · January 15, 2021, 9:18pm

That's correct, though an opaque identifier could give us a path to stable imports using a fully-qualified name.

Jon_Shier · January 15, 2021, 9:19pm

I mean, you could do the same thing from a URL. Personally, I'd rather not have to disambiguate modules at all, but that doesn't seem to be an option.

johannesweiss · January 15, 2021, 9:19pm

That's great and very useful for both options that are on the table right now, opaque & URI identifiers.

IF we actually always resolved at the URL given, then there would be a web of trust and I would agree. The fact is that we don't always do that. If you see https://github.com/apple/swift-nio.git you may think: I already trust GitHub, I trust Apple, and I trust the SwiftNIO community & team: I'm good here. But in fact you may have a registry proxy configured that pulls this package from https://evil.corp/bad-packages/swift-nio.git.

So I personally don't want to conflate opaque IDs and reverse DNS IDs. But for your argument this doesn't matter: Anybody can register $OPAQUE_STRING, that's correct but they can only register it once at a given registry, then it's taken.

If we take Rust or Haskell as examples, dependencies are usually just short strings like tokio and yes, there can only be one tokio. If you feel a little more enterprisy you can require these opaque names to be reverse DNS and you'd have to maybe prove that you own that domain indeed. But I think that's a discussion to be settled after we figured out if opaque or URI-based is the right way to go.

So for example in Rust, to depend on tokio, you'd add tokio = "1.0.2" to your Cargo.toml where tokio is also just an opaque ID. And I think it sets the expectations about what you need to trust correctly: Very clearly with just the string tokio, there must be some registry involved that I will have to trust. In the URI-based case, the URI may look safe but be malicious and that's bad.

NeoNacho · January 15, 2021, 9:20pm

I believe this concern is addressed by this part of Yim's post:

So we would not require everyone to switch to the new model, but have a path to "upgrade" existing packages to using the registry transparently.

Max_Desiatov · January 15, 2021, 9:24pm

If you trust everyone on that list, why wouldn't you trust the configured proxy? Or, to paraphrase, if your point is that a proxy can be misconfigured, why isn't this applicable to DNS or certificate settings? If you're particularly secure you'd use certificate pinning for verifying that GitHub and Apple are who they claim they are. What prevents you from verifying the proxy to the same degree then?

Jon_Shier · January 15, 2021, 9:27pm

I don't think this distinction is nearly as clear as you make it out here. In fact, neither should be trusted at first glance. Both require you to trust your network setup and whatever registry you're using, as well as any client settings. At least with a URL you know there's more at play.

johannesweiss · January 15, 2021, 9:34pm

The argument for the URI was "I know who I'm getting this software from based on just this one string" (which isn't true). The opaque IDs don't make the same claim. I think not making a claim is better than making a false claim.

And certificate pinning and all that jazz doesn't help (quite the opposite): The reason that even the URI-based system allows you to then actually go fetch it totally elsewhere is that this is required in the real world. If nobody had to ever rewrite where the source actually comes from, I don't think we would even be having this discussion. Anybody who wants to have a reliable build system will need to mirror all packages and re-configure SwiftPM where it gets them from. Or else you'll always need the internet & GitHub to work for you to build.

johannesweiss · January 15, 2021, 9:42pm

Are you saying that just the string tokio doesn't tell us that there's "the internet" involved but seeing but https://github.com/tokio-rs/tokio at least tells us "WARNING: Internet is happening" which then tells the developer that they need to audit the configuration? If yes: I can see your point.
But I still doubt that having a bunch of potentially wrong information (namely https://github.com/) in there helps. I'm totally happy requiring users to type package-registry://swift-nio if that conveys the "stuff involving configuration & internet" information.

Max_Desiatov · January 15, 2021, 9:45pm

Why can't the same argument about trust be applied here? Why would I trust a random registry or a mirror more than a registry proxy? The URL at least makes it explicit where the package is coming from, and I can make my own choices whether I trust it or not, and configure a proxy of my choosing based on my security policy. Removing information that describes the source of the package from the package ID will only obscure things.

Basically, I don't understand why is there an assumption here that reducing a package ID to just a shorter identifier will somehow make the source supplying it more trustworthy?

Exactly my point here.

johannesweiss · January 15, 2021, 9:58pm

Exactly! You should audit them in the very same way. So do we agree that by just looking at the strings (swift-nio vs https://github.com/apple/swift-nio.git) we learned exactly as much about whether we should trust it? What I mean with that is that by just looking at the string, we do not know where the code will come from, we also have to audit the configuration.

How? Only the URL + all the configured registry proxies (note this is not your system proxy so it won't check the TLS certificate of the host of the package URL or anything) tell you where it's from.

The very same thing applies with opaque IDs. We need the opaque ID + the registry to know where it's actually from.

You claim it is the source of the package. But it may not be, that's the point. If it always were the source then yes, this would help.

Hence, Mattt is proposing the really useful swift package discover tool which can tell you the true providence (at the time of running the tool, which isn't bullet-proof: TOCTOU):

This tool would then tell you -- for both URI-based as well as opaque ID based -- the true providence of the code.

Awesome. I'd argue we can find a better way to suggest "WARNING: Internet happening"? Why not package-registry://swift-nio or online-package-registry://swift-nio or something like that?

Jon_Shier · January 15, 2021, 10:02pm

No, you learn more from the URL. At the very least you've learned where initial requests for the package will go, barring your local configuration and redirect from the registry. If I know I haven't configured any local proxies (just like I am every day that I haven't done so for git), I know I can trust it unless SPM says its been redirected somewhere.

Max_Desiatov · January 15, 2021, 10:04pm

Of course not exactly the same, the first one omits crucial information, while the second one doesn't.

What's the benefit of using a custom URL instead of a standardized URLs we already use that contain much more information in a form that everyone is used to? https:// is already better than online-package-registry:// at indicating that internet is happening, while added github.com hostname means I should check my DNS settings (if I want to), and then the rest of the path tells me who maintains the package (at least in the case of GitHub/GitLab).

johannesweiss · January 15, 2021, 10:30pm

Wait, isn't that bad if the initial request goes to the correct source and later on it may no longer? I've been bitten by this before actually with the current mirrors feature where I didn't see that a .swiftpm/config was checked in that got packages from elsewhere.

I don't know what to reply to these two comments. Both imply that if I clone & build a repository that contains the dependency string "https://github.com/apple/swift-nio" will actually download https://github.com/apple/swift-nio. That is not the case, there may be configuration that makes SwiftPM actually download https://evil.corp/bad/swift-nio.

I don't get the benefit of adding information that may potentially be wrong?

One crucial thing we should probably discuss is where exactly the configuration lives. In the current SwiftPM mirrors feature, the mirrors can be configured (at least) through:

the environment
through a .swiftpm/config file in the repository

As an example I've built a repo whose Package.swift says https://github.com/apple/swift-nio yet when you build it, it'll pull https://github.com/weissi/swift-undefined instead. That works today and I'd think that a similar thing would work in the future with the package registry proxies. Except that it's worse because today it's a textual mapping and in the future it goes through a blackbox on another computer. What the proxy could do is to only rewrite the URLs if the request comes from the build server, then the developers trying to audit on their own machines will see everything as intended.

Jon_Shier · January 15, 2021, 10:39pm

... No? That SPM's features work correctly isn't good or bad, it's just how the thing works. That SPM doesn't tell you every time a package is redirected is a failure of the SPM UX and not something that package identifiers can solve. In the question of URL vs. string, URL at least contains some useful data where string contains none.

NeoNacho · January 15, 2021, 11:04pm

I think the tricky part in a registry world is that there's no observable client side redirect anymore, you ask for "GitHub - apple/swift-nio: Event-driven network application framework for high performance protocol servers & clients, non-blocking." or "org.swift.swift-nio" and the registry simply delivers some source code. This isn't a huge departure if the registry is identical with the SCM host, but if it isn't, the difference between using a URL or some other string as the identifier seems marginal to me w.r.t. trust.

This is not an argument against using URLs, but at least to me, this means the ultimate trust model is the same no matter if the identifier is opaque or not.

Jon_Shier · January 15, 2021, 11:27pm

I do think it's advantage for URLs, even if rather small, to be able too see where the package is supposed to come from, even if SPM reports it was redirected somewhere else. I can immediately copy / paste the URL and visit the intended source to see if that redirect was intentional. It also makes it easier to verify that that's where traffic actually went using tools like Little Snitch. Otherwise, to verify the sources of your packages, we have to have Mattt's suggested discover command, as there'd be no way to even get that initial information.

tachyonics · January 16, 2021, 1:02am

I think there is a couple of points here-

The URL scheme proposed seems to using the domain to actually indicate where the dependency is homed - that is where the original/canonical version of the dependency is - as opposed to necessarily the location where the dependency is retrieved from
In a distributed package ecosystem world - assuming we want the eco-system to remain as such - this information does seem important; both from an identity perspective - "this is the OctoCorp/linkedlist dependency homed at github.com" (which is a unique identity) as opposed to some other OctoCorp/linkedlist dependency homed elsewhere; and also from a discovery perspective - regardless of mirrors and anything else I can go to github.com for support.
Overall though I agree with @johannesweiss that using the URL scheme as proposed confuses the distinct concepts of retrieval location and home registry in a way that is probably best avoided.

I think there are potential options here that would be clearer-

.package(identifier: "apple/swift-nio", homed:"github.com" from: "2.0.0")

This could also potentially solve the migration problem as SPM should have enough information with something of this or a similar form to determine that it is expressing the same dependency as-

.package(url: "https://github.com/apple/swift-nio[.git]", from: "2.0.0")

johannesweiss · January 16, 2021, 12:25pm

You phrased this way better than I did. I fully agree.

Yes, this works fine for me (apple/swift-nio would be the identifier to identify a package, even it it moves elsewhere apple/swift-nio cannot change). It also happens to (probably) solve the actually hard problem with URIs: We need to (and not all other languages do [*]) absolutely have only one version per package in our build.

Example: A very popular package starts being https://github.com/person/swifty-awesome and later moves to https://gitlab.com/swifty-awesome/swifty-awesome, then we must be able to figure out that https://github.com/person/swifty-awesome == https://gitlab.com/swifty-awesome/swifty-awesome or else we will break builds. In corp environments where everything is mirrored, each package would have even more identities (such as https://my.corp/mirrors/swifty-awesome).

It's unreasonable to assume that everybody [who owns a package in the community] immediately updates their dependencies on swifty-awesome from https://github.com/person/swifty-awesome to the new "canonical" https://gitlab.com/swifty-awesome/swifty-awesome and therefore there will be package graphs that contain both.

And we just can't pull in both and pretend they're two separate packages. We could add compiler support to add the package URL to the Swift name mangling and then the GitHub and the GitLab versions would actually not clash for Swift modules. The problem is though that SwiftPM can also vend C modules and we cannot namespace C modules because we can't just change the C ABI, and also C doesn't have name mangling (similar problems with ObjC, C++, Assembly, and other languages that we can't just change at will through swift-evolution). This is a problem unique to package managers that can compile more than one language.

So I think it's clear that we do definitely need to de-deduplicate package names. For opaque IDs this is pretty simple (same ID -> same package; different ID -> different package), for URIs this is only simple if we decide to not support changing the origin of a package or we come up with some other way of deduping (Go seems to do this primarily through vanity URLs). If we went with vanity URLs, then the best practise should be to start fresh projects with registering a vanity URL (so you can move source code host later). If we then have well established "vanity URL services", this feels pretty much like opaque IDs + registry so any potential benefit of URLs vanishes).

Yes, there are other ways of deduping (eg. through HTTPS redirects, well-known files, vanity URLs, ...). But if you want to be able to build without internet (often required in corporate build servers) will still need a huge "identity database" (where they'd record that https://github.com/person/swifty-awesome == https://gitlab.com/swifty-awesome/swifty-awesome or else they can't build without full internet access and all services being up and running).

* Why do other languages not necessarily require each package in the graph to be a unique version across the whole binary?

Let's say we had a language whose package manager can only compile modules in its primary language. Then we could add a feature to the package manager (& compiler) that allows us to load (in one and the same build) multiple versions of the same package. To avoid symbol clash, the package manager could embed the URI (or a hash thereof) into the name mangling and to the linker symbols from https://github.com/person/swifty-awesome and https://gitlab.com/swifty-awesome/swifty-awesome would be different symbols. So as long as we don't require anywhere that the types from both swifty-awesomes be the same, this all kinda works fine.

(Un?)fortunately, SwiftPM allows us to build at least Swift, C, C++, ObjC(++), and Assembly. We don't have any control over the ABI and the future development of most of these languages. Therefore we cannot do the same name-mangling trick above. So if swifty-awesome has a C module, then even if we mangled the URI into the Swift symbols, the C symbols would still clash and the build would fail.

I believe in Go this is a non-issue because they can name-mangle in the providence of everything because they only build Go. That makes it feasible (does have problems too but irrelevant here) to just name-mangle in the providence URI and have two versions of the same package in the same binary.

tachyonics · January 17, 2021, 5:43am

I hadn't thought about this use case but it is definitely something to be considered.

My thoughts writing my post above is that that package identity would include the home repository as it guarantees unique identities without a central identity service (the home repository domain is essentially providing an identity namespace for the packages that it homes) and that identity wouldn't change regardless of where the package is hosted-

Package(identity: "the apple/swift-nio package homed at gitHub.com", hosted: "gitHub.com") has the same identity but a different location to Package(identity: "the apple/swift-nio package homed at gitHub.com", hosted: "myMirror.com")

It doesn't solve the above use case though. My initial thought would be for the registry to be able report alias identities for a package - "the amzn/smoke-framework package homed at gitHub.com" is equivalent to "the amzn/smoke-framework package homed at gitLab.com" or even "the aws/smoke-framework package homed at gitHub.com" which a corporate mirror repository would be able to provide to a build server without the internet and a well-behaved registry would not reuse identities even if the package is moved elsewhere.

Something like that does provide some robustness around package identity uniqueness. The alternatives seem to be a central identity service or deciding that guaranteeing unique identities for packages isn't actually providing us anything because you have to trust your build system to get the right package anyway.

As another point, if you look at the example of Soto, their entire name changed. Even for opaque ids would be reasonable to expect a project keep the same id for its entire lifetime?