SE-0292: Package Registry Service

The review of SE-0292, "Package Registry Service" , begins now and runs through December 17, 2020.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager or direct message in the Swift forums).

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for helping improve the Swift programming language and ecosystem.

Tom Doron
Review Manager

17 Likes

Looks good to me.

Initially, Swift Package Manager will use a package registry to resolve dependencies only when the user passes the --enable-package-registries command-line flag. This may change in a future release.

Could you elaborate on the reasoning here? Is it just general caution/staging or is enabling package registries expected to break builds or have performance/security issues?

// :white_check_mark: These dependencies qualify for resolution with package registry
.package(url: "https://github.com/mona/LinkedList", from: "1.1.0")
.package(url: "https://github.com/mona/LinkedList", .exact("1.1.0"))
.package(url: "https://github.com/mona/LinkedList", .upToNextMajor(from: "1.1.0"))
.package(url: "https://github.com/mona/LinkedList", .upToNextMinor(from: "1.1.0"))

Does this reflect a hope that GitHub would be a package registry in future, or is it just an example URL that I should mentally substitute for some future official/unofficial Swift package registry.

1 Like

More the former. Downloading and resolving dependencies is one of the primary functions of Swift Package Manager, so it's reasonable to make changes to this behavior opt-in through a feature flag. It lets us address any unforeseen compatibility or performance issues from early adopters without impacting the stability for everyone else.

GitHub announced their intent to support Swift packages in GitHub Packages (which was previously called GitHub Package Registry). If adopted, this proposal would define a standard registry interface that GitHub could implement to add support for Swift packages.

That's actually another reason for putting this behind a feature flag: Most users won't benefit from registry support until GitHub adds support for Swift packages, but that support is predicated on a standard registry interface. :chicken::egg:

7 Likes

Thank you all for putting this proposal together. The AWS CodeArtifact team is excited to see progress toward a Swift package registry, and we have some thoughts on the approach.

Package publishing

Taking inspiration from current best-practices like continuous integration (CI) and continuous delivery (CD), this proposal instead follows what we describe a "pull" model. When a package owner releases a new version of their software, their sole responsibility is to notify the package registry. The server does all the work of downloading the source code and packaging it up for distribution.

But the deciding factor was that we saw publish as unnecessary; we imagine package publication to be the final outcome of a successful CI /CD pipeline to be run automatically, rather than a command to be run manually.

There are several important security-related reasons to support push-based publishing:

  1. Users will not always be willing or able to allow network access to their source repository from the service.
  2. Users will not always be willing or able to grant authorization to the service to read from the source repository.
  3. Users cannot easily determine if the service imported the files correctly (e.g. via checksum validation), especially if the service is packaging them into a single archive as part of the process.
  4. Users have no opportunity to sign the package prior to publication.

While pull-based publishing may work reasonably well for publicly-accessible, unauthenticated source code repositories where both the package registry and the end users can consult the original source code if necessary, it does not work well when the source code repository is private.

Additionally, enterprise environments may have a variety of authentication mechanisms for source control repositories which may not be supported by all clients.

Package Identity

We recommend adding an additional identifier alongside the package name, to disambiguate when two vendors wish to vend a package with the same name.

To that end, we suggest that packages in the registry be required to have a hierarchical namespace identifier.

Namespaces serve two primary purposes:

First, they allow users to group packages by origin or purpose, and with greater or less specificity; a hypothetical AWS SDK's base libraries might be published under a parent namespace software.amazon.aws.sdk, with service-specific libraries under child namespaces software.amazon.aws.sdk.s3, software.amazon.aws.sdk.ec2, and so on.

Second, they allow the service to scope permissions in intuitive ways. The registry may grant a user permission to publish any package under software.amazon.aws, which would allow that user to publish to software.amazon.aws.sdk or software.amazon.aws.sdk.s3, but not software.amazon. In this way, the registry can grant permission to publish to different portions of a namespace to different users.

If this namespace concept is not explicitly included in the registry definition, then users will end up needing to bake it into the package name, and permission management would need to be done based on prefix matches, which more difficult to manage.

Dependency resolution

When package registries are enabled, Swift Package Manager will first attempt to use package registry API calls to resolve qualifying dependencies, falling back to Git operations if those API calls fail.

Transparently falling back to a git endpoint when the package registry fails seems like unexpected behavior which could result in inconsistent dependency resolution across builds. As a developer I would prefer to have more specific control over where my dependencies come from. (At a minimum, users should be able to disable this fallback behavior.)

To that end, we would suggest introducing a different dependency declaration format for packages sourced from a registry, so that the dependency always comes from a predictable place, perhaps something like this:

.package(registry: "registry.swift.org", namespace: "software.amazon.aws.sdk", name: "awssdk", from: "1.2.3")

The registry field lets users control which registry to pull the package from in case more than one registry is defined (presuming the client supports multiple registries). A user may want to consume the LinkedList package from a public Swift registry, but consume WidgetBuilder from their company's private registry.

Referencing the package by namespace and name allows publishers to use a private source control system, move their source code from one place to another, or migrate from one source control system to another, all without breaking users.

Package signing

As noted earlier, signing is an important part of vending packages that consumers can verify and trust. The publishing process needs to support a way for the publisher to include a signature alongside the other file(s) in the package, and the registry needs to vend that signature in a way the client can consume, perhaps alongside the checksum.

The registry itself does not necessarily need to be responsible for vending the public keys required to verify those signatures; for example, the client could rely on end users having a keystore of trusted keys (perhaps with a few known keys built in), the way e.g. apt does it.

API Versioning

We strongly recommend the URL path include an API versioning component as a prefix for the paths listed in the proposal. For example, to download the archive for a package, this client would make a request against this URL:

/v1/{namespace}/{name}/{version}.zip

There are a variety of different ways to version APIs; "v1" is simple but you may also consider a date-based or semver-compatible API versioning scheme.

10 Likes

Some additional supporting information in favor of using a HTTP publishing based registry rather than pull based git solution. A few other package registries (Ruby, Rust) that began with a git based registries have experienced challenges that have led them to pursue HTTP instead.

This Rust RFC highlights some of the reasoning.
https://github.com/kornelski/rfcs/blob/http-index/text/0000-sparse-index.md#readme

4 Likes

A formal specification for the package registry interface is provided alongside this proposal. In addition, an OpenAPI (v3) document and a reference implementation are provided for the convenience of developers interested in building their own package registry.

Also I've been able to find the reference registry implementation, but I haven't been able to find the OpenAPI (v3) document. Is there any way we could link it?

Are there formal definitions for the metadata endpoints yet, or will the shape of metadata be at the discretion of the implementer? Is there any way we can include source license information in said endpoint definition?

@StanTwinB Thank you for sharing your feedback! I'll do my best to address each of your points below.

But before we dive in, I wanted to make sure that y'all saw the package registry specification itself (from this PR), which goes into much greater detail about the specifics of how everything works.

You can also refer to previous discussion threads [1] [2] [3] for additional context for some of the decisions we made regarding identity, security, and other concerns.

From the specification:

3.6. Package name resolution

Each external package is uniquely identified by the canonical URL of its source code.

I agree that packages need to be properly namespaced in order to disambiguate between packages with the same name. Our approach uses URLs, which uniquely identifies packages and confers a host of additional benefits, including global addressability (you can copy-paste a URL into your browser), security (you can authenticate via existing TLS certificate infrastructure), and naming authority (we can delegate any ownership claims to ICANN and domain registrars).

As discussed in this forum post, we believe that a checksum database / transparent log — in concert with the aforementioned TLS certificate infrastructure — can provide better security guarantees than digital signatures.

I don't necessarily agree with this assessment on a few points:

Any user with source access can verify the build product of the registry by running swift package archive on their project at the corresponding commit reference and comparing the resulting Zip archive.

As discussed above, our security model doesn't require digital signatures. Though, for what it's worth, users can GPG sign the Git commit used to generate the source archive, and that commit ID will be included as a comment in the generated Zip file:

$ git rev-parse HEAD
b7c37c81f164e5dce0f64e3d75c79a48fb1fe00b3

$ swift package archive -o LinkedList-1.2.3.zip
Generated LinkedList-1.2.3.zip

$ zipnote LinkedList-1.2.3.zip | grep "@ (zip file comment below this line)" -A 1 | tail -n 1
b7c37c81f164e5dce0f64e3d75c79a48fb1fe00b3 

$ git verify-commit b7c37c81f164e5dce0f64e3d75c79a48fb1fe00b3
gpg: Signature made Tue Dec 16 00:00:00 2020 PST
gpg:                using RSA key BFAA7114B920808AA4365C203C5C1CF
gpg: Good signature from "Mona Lisa Octocat <mona@noreply.github.com>" [ultimate]

Under these conditions, a registry may not be the most appropriate mechanism for publishing packages. Instead, a better option might be to distribute binary frameworks.

I would also note that the publishing model described by this specification doesn't preclude a registry from publishing packages out-of-band through another process. For example, if GitHub wanted to backfill support to existing packages, they might offer a tool that automatically creates package releases from existing semantic version tags.

Thanks for sharing that. However, I'm not sure how this relates to your earlier point or this proposal specifically.

My understanding of the linked Rust RFC is that it's related to the problem of the crates.io registry using Git as a blockchain (which is similar to the situation CocoaPods faced with its Specs repository until it provided a trunk).

To be clear: this proposal indeed follows the HTTP-based publishing model advocated by this RFC. A client doesn't need to interact with Git in order to publish a new package. Rather, the registry pulls the source code from the repository specified by the client (which may be Git or another VCS) and publishes the source archive.

Are you concerned about the use of Git as a blockchain for a transparency log by the registry? Or are you advocating for the ability for clients to push Zip archives to registries as an alternative to the registry creating them from source?

Thanks for this feedback. I agree that it would be desirable to disable this fallback behavior.

For this proposal, we considered four possible versioning strategies:

  • Path : http://api.example.com/v1
  • Subdomain : http://api.v1.example.com
  • Custom MIME type in Accept header : Accept: application/vnd.example.v1+json
  • Custom request header : Accept-version: v1

Each of these options has strengths and weaknesses, but we ultimately decided that the custom MIME type in Accept header offered the best combination of flexibility and ergonomics. It also has worked well in combination with the content negotiation used to upgrade existing package manifests from repository-based dependencies to registry-based dependencies.

Yes, indeed! Please see this section of the specification:

A server MAY include metadata fields in its package release response.
It is RECOMMENDED that package metadata be represented in JSON-LD
according to a structured data standard. For example, this response using the Schema.org SoftwareSourceCode vocabulary:

{
  "@context": ["http://schema.org/"],
  "@type": "SoftwareSourceCode",
  "name": "LinkedList",
  "description": "One thing links to another.",
  "keywords": ["data-structure", "collection"],
  "version": "1.1.1",
  "codeRepository": "https://github.com/mona/LinkedList",
  "license": "https://www.apache.org/licenses/LICENSE-2.0",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "Swift",
    "url": "https://swift.org"
  },
  "author": {
      "@type": "Person",
      "@id": "https://example.com/mona",
      "givenName": "Mona",
      "middleName": "Lisa",
      "familyName": "Octocat"
  }
}
4 Likes

Strong +1.

Yes. This proposal potentially unlocks performance improvements in SwiftPM dependency resolution, and every improvement on that front is very welcome, especially for big projects.

Yes, it does. It's a developer experience improvement, which should be definitely a direction for improvements in Swift and SwiftPM.

It compares really well. This is on par with what Rust developers get with crates.io, JavaScript developers with npm, and Python developers with PyPI.

I read the proposal and I'm following implementation PRs in the SwiftPM GitHub repository.

3 Likes

Are there any implementation details for how SwiftPM's dependency resolution is going to change? My concern is that whilst this sounds great for pinned versions or small projects, very large projects may suffer. For instance, currently to resolve the dependencies, SwiftPM downloads the single Git repository (which admittedly can be large) and then can quickly change between tags to check transitive dependencies. With the new proposed implementation, it seems that either a new manifest needs to be downloaded for each version of every target (which is a lot of network connections) or the entire package needs to be downloaded for every version.

I also haven't seen anything that addresses private dependencies. Currently you can integrate private packages using git@github.com:0xTim/MyPrivatePacakge.git - this will continue to work but will it mean that the entire dependency resolution will revert to Git based dependency resolution or just for that package? And has there been any thoughts into how private dependencies can be managed with the new service?

Finally, how does this fit in with the new system wide cache of dependencies for SwiftPM - I would guess that by default SwiftPM will default to the cache, but there's no mention of it in the proposal.

Overall though, big +1 to this! Doing dependency resolution with a Package.resolved file that contains the versions you already want should be a lot quicker!

2 Likes

You can see the latest working implementation in this PR to apple/swift-package-manager.

You're correct in pointing out that the fixed cost of downloading the entire history of a project by cloning its source repository may, under some circumstances, may be less than the sum of downloading multiple releases through a registry. However, in practice, I believe a registry will be as fast if not faster than a repository for a few reasons:

  • We can see all of the available releases for a package from a single endpoint
  • We define a separate endpoint for downloading Package manifests at individual versions, without downloading the full archive
  • In general, globally-distributed CDNs are faster at serving Zip files over HTTP than repository hosts are at serving code via Git (even if that's Git over HTTP)

Ultimately, these are empirical claims that will have to be verified by real-world usage. I have a benchmark harness that you can use to try it out for yourself. At the moment, the results are promising:

$ bundle exec rake clobber benchmark
time ./spm run
       57.70 real        87.12 user        10.88 sys
time ./spm run --enable-package-registry
       16.24 real        35.85 user         4.75 sys

If a repository-based external dependency itself has external dependencies that can be resolved through a registry, then SPM will attempt to do so.

This came up in a previous discussion thread, but you're right — this isn't mentioned in the proposal or specification.

Alternatively, you can provide credentials through a .netrc file.

These features have been developing in parallel, and the current draft implementation of registry support doesn't support the system-wide cache. That said, I think it'd make a lot of sense to add support eventually.

3 Likes

What is your evaluation of the proposal?

+1. It looks like the natural next step.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Yes, but it was some time ago.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

A quick reading.

I've mentioned it before, and I'm going to mention it again - there simply isn't enough time to give this an adequate review.

A formal specification for the package registry interface is provided alongside this proposal. In addition, an OpenAPI (v3) document and a reference implementation are provided for the convenience of developers interested in building their own package registry.

Where? I don't see them linked in the proposal document or announcement post. I spent a couple of days considering and researching the proposal before I even found that document. I don't have the time to review this proposal + the formal specification in 8 days, together with participating in the multiple ongoing concurrency reviews and proposals soon to be reviewed, plus everything else I have to do in life. I don't know if this kind of crunch and lack of work-life balance is normal at Apple, but it's really not acceptable to force it on the community and I'm going to continue to speak out against it.

It's not only bad for the community (both for its members and as a whole), but I fear this effort to rush proposals through before the end of the year and without proper scrutiny is going to lead to a reduction in quality. I hope the core team will accept responsibility for the consequences.


That being said. Here are a couple of points:

1. Ban credentials in URLs

Firstly, RFC3986 deprecated the password component 15 years ago:

Use of the format "user:password" in the userinfo field is deprecated. Applications should not render as clear text any data after the first colon (":") character found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password). Applications may choose to ignore or reject such data when it is received as part of a reference and should reject the storage of such data in unencrypted form. The passing of authentication information in clear text has proven to be a security risk in almost every case where it has been used.

The same RFC warns that credentials present a major spoofing risk:

Because the userinfo subcomponent is rarely used and appears before the host in the authority component, it can be used to construct a URI intended to mislead a human user by appearing to identify one (trusted) naming authority while actually identifying a different authority hidden behind the noise. For example

ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm

might lead a human user to assume that the host is 'cnn.example.com', whereas it is actually '10.0.0.1'. Note that a misleading userinfo subcomponent could be much longer than the example above.

Secondly, major browsers have stopped supporting credentials in URLs:

  • Internet Explorer stopped it back in IE6 (yes, in at least this respect, IE6 was ahead of its time. You do not get to say that very often).
  • Safari stopped supporting them in iOS 11. I can't find any official announcement or release notes (or even which version of Safari shipped in iOS 11), but a web search shows that customers noticed. I just tested it by visiting https://guest:guest@jigsaw.w3.org/HTTP/Basic/ (a W3C test server), and even though the URL contains credentials, Safari seems to drop them and asks you to enter them manually.
  • Chrome is trying their best (status and spec discussions). I think the current state is that they ignore credentials in subresource requests but allow them for top-level navigation, or something like that.

I saw the GitHub examples in the proposal, which was a bit surprising (why are GitHub helping to resurrect this relic?). I eventually found this post from 2012 explaining it.

Now, I'm not a web developer, and I'm not going to pretend to know anything about OAuth and the best practices for using it, but this seems like a pretty irresponsible feature for GitHub to be supporting. Here's what they say in the blog post:

If you’re cloning inside a script and need to avoid the prompts, you can add the token to the clone URL:

git clone https://<token>@github.com/owner/repo.git

or

git clone https://<token>:x-oauth-basic@github.com/owner/repo.git

Note : Tokens should be treated as passwords. Putting the token in the clone URL will result in Git writing it to the .git/config file in plain text. Unfortunately, this happens for HTTP passwords, too. We decided to use the token as the HTTP username to avoid colliding with credential helpers available for OS X, Windows, and Linux.

So they tell you that this feature is designed for scripts, but to watch out because your token will be stored in the git config file in plaintext -- ignoring the fact that the token is already stored as plaintext in the script (or in our case, the package manifest). Doesn't make sense to me.

A quick look at the OAuth spec mentions that tokens can be put in to URL components (they use the query component as an example), but seems to discourage its use:

Because of the security weaknesses associated with the URI method (see Section 5), including the high likelihood that the URL containing the access token will be logged, it SHOULD NOT be used unless it is impossible to transport the access token in the "Authorization" request header field or the HTTP request entity-body. Resource servers MAY support this method.

This method is included to document current use; its use is not recommended, due to its security deficiencies (see Section 5) and also because it uses a reserved query parameter name, which is counter to URI namespace best practices, per "Architecture of the World Wide Web, Volume One" [W3C.REC-webarch-20041215].

And here is the Section 5 they refer to:

Don't pass bearer tokens in page URLs: Bearer tokens SHOULD NOT be
passed in page URLs (for example, as query string parameters).
Instead, bearer tokens SHOULD be passed in HTTP message headers or
message bodies for which confidentiality measures are taken.
Browsers, web servers, and other software may not adequately
secure URLs in the browser history, web server logs, and other
data structures. If bearer tokens are passed in page URLs,
attackers might be able to steal them from the history data, logs,
or other unsecured locations.

Given all of this, I would propose that we outright ban credentials in registry URLs, and I'd actually go one step further and ban them from all package URLs.

2. Underspecified components

The proposal mentions certain conditions for registry URLs, but not all components are specified. Can registry URLs contain query strings?

Note: I had more to say here, but after discovering the "formal specification", my questions have changed. Where does the package name in the GET /{package} actually come from? Is its character set limited? Why do the examples in that document all say things like GET /github.com/mona/HashMap? How does the client know that the package sources are hosted on GitHub? Is github.com part of the package name?

Lots of questions. No time :man_shrugging:. Give us a proper review duration and I'll write them up.

3. My suggestion

I've spent the last ~8 months diving deep in to URLs (it's not all I've been doing, but it's one of the things), and the closer you look at them, the worse they look.

There's a lot to say about the deficiencies of URLs, but luckily I can recommend a couple of videos instead (one short, one long).

  1. HOW FRCKN' HARD IS IT TO UNDERSTAND A URL?! - uXSS CVE-2018-6128 (15 mins). Explains a universal cross-site scripting bug that affected WebKit on iOS. I haven't looked in to the bug in detail, but either there were different URL parsers in the OS which saw different hosts for the same URL, or there was an idempotency bug (parsing -> serialising -> parsing the URL changed its meaning).

  2. A new era of SSRF (47 mins). A now famous talk by Orange Tsai about how to exploit quirks in different URL parsers. It includes this slide, demonstrating how each of the common parsers used by Python sees a different host from the same URL. It's so beautiful I'm thinking about getting a poster made of it:

FWIW, we have similar issues...

import Foundation
let url = URL(string: "https://test1@test2@test3/")!
print(url.host) // Optional("test2@test3")
print(url.standardized.host) // Optional("test2@test3")

Safari and most other browsers will consider the host to be test3. The older URL spec that Foundation's URL type follows left this case ambiguous. Newer URL standards have tightened their definitions, so Foundation has been left with nonstandard behaviour that could open the door for misunderstandings and exploits. Presumably SwiftPM would use Foundation's URL support and inherit its quirks.

And then you consider lower levels - I've heard that Foundation's networking is built on cURL, but how does cURL parse URLs? Does it always agree with Foundation? Again, see Orange Tsai's talk about the cURL maintainers' approach to this. There are plenty of times when your URL library says the host is x, but then you make the request and it goes out to some other host y.

There have been so many attempts to standardise URLs over the decades, and none of them have really worked. Applications diverged and added special behaviours (or had bugs), which people relied upon, which caused them to spread, which defeated the standardisation effort. The WHATWG has had to reduce its ambitions with the latest spec: it's now a living document (things can change at any time as new quirks are discovered), and one of the main goals now is just to document reality, not to enforce best practices. In fact, this is literally a quote from the latest standard:

The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated (and in some cases nested) conversions between character encodings and byte sequences. Unfortunately the format is in widespread use due to the prevalence of HTML forms.

So if we are going to build something better than our current solution - more secure, more robust - we should take another look at whether we even need to use URLs at all. Let's consider the information packed inside a URL, and whether we actually need it:

Component Needed?
Scheme :x: - always https
Credentials :x: - bad idea, officially deprecated
Hostname :white_check_mark:
Port :grey_question: - always 443?
Path :white_check_mark: - maybe? I assume the package name comes from here somehow
Query :x: - I assume they are not supported
Fragment :x: - doesn't even get sent to the server

So out of everything URLs can do, and all the quirks they have to support, we basically only need 2 things: the hostname of the registry, and a package name (and a port if we want to support nonstandard ports).

Just ask for that data as 2 parameters, keep it separate, and you'll avoid all of those funky URL issues and their associated security vulnerabilities. There are no lower levels (like cURL) which can misinterpret which part is the host, and since we're now explicitly taking a package name instead of a path component (or wherever else that package name comes from), we can add semantic meaning as required (e.g. case insensitivity):

.package(name: "mona/LinkedList", registry: "GitHub.com", from: "1.1.0")
                     |
                    This is an opaque string, not a 'path'

It means that instead of automagically "upgrading" normal dependencies to use the registry, users would have to change their package manifests, but it would give us a much more robust system than we have now or the one proposed.

11 Likes

In general, an enthusiastic +1 for this proposal.

One issue worth considering is how and if registry mirrors will be permitted. Taking China as the obvious example, there are several mirrors set up by corporations or universities, e.g:

Note Rust Crates' mirror deployment instructions, which are used by USTC. Is this something that will be considered for this proposal, with its benefits but adverse implications on security?

This last sentence especially seems open for interpretention. Does it mean completely different URLs might be accepted as equivalent by some configuration?

3.6. Package name resolution
...
Each external package is uniquely identified by the canonical URL of its source code. Therefore, a package is a shared dependency of two packages if and only if both of them declare an external dependency with the same URL.
...
A client MAY use other techniques to determine that two dependencies are equivalent, such as comparing their contents, structure or history.
2 Likes

What is your evaluation of the proposal?

Huge +1 on that! A package registry service is very important as an alternative to git, and I think the general idea and design is very great.

Is the problem being addressed significant enough to warrant a change to Swift?

Of course! Supporting registry is another huge leap of SwiftPM, which will speed up the building process in various conditions.

Does this proposal fit well with the feel and direction of Swift?

I think yes. The proposal defined a very possible and fully functional model for SwiftPM to work with an HTTP registry, but I do have some safety concerns. The proposal suggests an endpoint for retrieving Package.swift through HTTPS, which will then be executed on the local machine. How to protect it from attacks like MitM, which may violate the manifest code? Shall we force the manifests to be encrypted during HTTP transfers?

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I think it’s certainly a great one! The proposal empowers SwiftPM and fits its model quite well, but security needs to be taken extra care of especially when downloading codes (including manifests) from the Internet, since Swift will execute them instead of parsing them. Also, I think the proposal would be better if it includes contents about mirroring a registry as people in specific network conditions will certainly need this.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I’ve been tracking this proposal since it first appears on the forum! Very glad to see it fully drafted. I’m pointing out these potential problems from the perspective of a user and a manager of a registry.

1 Like

Thanks for taking the time to review and share your thoughts, @Karl. Responding to your individual points:

I agree that this is a topic for further consideration, but I don't think that this is particularly relevant to this proposal. Both package registries and conventional, repository-based dependencies will share a common authentication mechanism through Swift Package Manager. The only reason I brought up hardcoded credentials in my response to @0xTim was to describe an existing strategy for accessing non-public external packages.

For context, Swift Package Manager only recently added support for .netrc files [1]. Any decision to ban credentials in URLs should take a considered approach, where deprecation warnings and documentation give SPM users an opportunity to migrate to a better solution.

Packages are uniquely identified by a canonicalized form of a URL that locates their source code. The exact behavior of this is described by the CanonicalPackageIdentity, which was introduced in this PR to apple/swift-package-manager on November 18, 2020.

That's not entirely correct. As I said before, and describe in the proposal, packages are identified by URLs. We use URLs as identifiers not just because a URLs components map to the parameters we need, but because they locate a server resource. As I discussed in my response to @StanTwinB:

From the beginning, Swift Package Manager has taken a federated approach to identity. I think that's a good thing. In contrast to other systems, there's no centralized naming authority for packages. You aren't requesting a package named mona/LinkedList that happens to be hosted on GitHub, you're requesting the package github.com/mona/LinkedList.

By using canonicalized URLs to identify packages, we're saying that https://github.com/mona/LinkedList and ssh://git@github.com:mona/LinkedList.git are the same, and we use HTTP content negotiation to upgrade requests to use a faster, more secure registry interface, when available.

Getting back to the concerns you raised about edge cases in parsing URLs: I'm having trouble imagining a scenario where this ambiguity could be exploited. If an invalid or ambiguous URL is provided, it will either fail to resolve in an HTTP request or it will cause dependency resolution to fail. Do you have a specific concern about how these URLs could cause problems for the package ecosystem?

2 Likes

@yonihemi @stevapple Thanks for weighing in! Responding to your points:

We imagine two primary mechanisms for mirroring — one on the client, and one on the server.

Client-side, you can use swift package config set-mirror (as described above) to route individual packages to a different endpoint. For example, if your project included github.com/mona/LinkedList as a direct or transitive dependency, you could set a mirror to resolve and download it through another server (perhaps geographically closer or within an internal network).

In the future, we could also add support for Swift Package Manager to set blanket policies on how package URLs are routed, rather than specifying mirrors individually. However, we consider this to be out of scope for this proposal.

Server-side, a registry may use Link headers in their response to designate alternative download locations / mirrors. From the specification:

4.4.2. Download locations

A server MAY specify mirrors or multiple download locations using Link header fields with a duplicate relation, as described by RFC 6249. A client MAY use this information to determine its preferred strategy for downloading.

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: public, immutable
Content-Type: application/zip
Content-Disposition: attachment; filename="LinkedList-1.1.1.zip"
Content-Length: 2048
Content-Version: 1
Digest: sha-256=a2ac54cf25fbc1ad0028f03f0aa4b96833b83bb05a14e510892bb27dea4dc812
ETag: e61befdd5056d4b8bafa71c5bbb41d71
Link: <https://mirror-japanwest.example.com/mona-LinkedList-1.1.1.zip>; rel=duplicate; geo=jp; pri=10; type="application/zip"

Yes. For example, projects on GitHub may be renamed or moved to other organizations, such that github.com/mona/LinkedList and github.com/Octogroup/LinkedList-Swift are equivalent. Over HTTP, this relationship is indicated by a redirect (303 status code). Swift Package Manager could use this information to reconcile nodes in the dependency graph to a single package identity.

Package registries can provide greater security guarantees than the current approach of fetching source code directly from repositories.

The package registry requires all communication to occur over secure HTTPS connections, which goes a long way to mitigating man-in-the-middle attacks. Swift Package Manager currently allows repository URLs to use insecure HTTP URLs, but most code hosts — including GitHub — automatically upgrade HTTP requests to HTTPS, so the potential security impact is limited. We could, as a separate measure, consider generating warnings or errors for external repository-based dependencies specified with an insecure protocol.

We use checksums as an additional mechanism for ensuring that the package we're downloading is authentic, and hasn't been tampered with.

All of that said, and to your point — a secure connection and verified checksum doesn't guarantee that the package being downloaded isn't itself nefarious. For that reason, package manifests are evaluated within a sandbox, which limit their effect on the system (this is true currently, and is unaffected by the registry).

1 Like

I think that’s a very important one. At least, for the current proposal, we should allow users to set a mirror of the whole registry by such command:

$ swift package config set-mirror \
--original-url https://github.com \
--mirror-url https://localhost:8080/github.com/

which would be prefix-matched and substituted.

Also, maybe we should work out a basic way that a mirror can use to sync with the origin. I’m very eager to build such a mirror for Chinese users because both traffic to GitHub and Swift.org can be terrible here and we definitely need one.

I agree that this is an important issue, and I think we should explore our options for how best to support this use case in the future.

Until we settle on a long-term solution, there are a few different workarounds that don't require any changes to Swift Package Manager:

$ echo "github.com 10.0.0.1" >> /etc/hosts
$ ifconfig lo0 10.0.0.1 alias
$ ipfw add fwd 127.0.0.1,8080 tcp from me to 10.0.0.1 dst-port 443

Sure - I mention it because this proposal already limits which kinds of URLs get the registry treatment. It could also be done later.

Ah, that's interesting - I was wondering about this. I saw some mention of “canonical” URLs in the proposal, but I couldn't find a definition for what that meant and AFAIK it's not a standard thing. So I think that needs to be part of this proposal, and just by itself needs a lot of careful attention.

Is this the {package} part of the GET /{package} request? Because if so, I think it's important for registry implementors to know exactly how that name is generated.

Additionally, I have several issues with the implementation.

  • It tries to parse URL string by itself, and I cannot overstate how bad of an idea that is.

  • It has similar flaws to Foundation’s implementation (e.g. considering the first “@“ to be the userinfo/hostname separator, when it should be the last “@“ in the authority component).

  • It keeps the hostname but drops the port. A "host" is a combination of hostname + port, and both are necessary IMO, but that's straying outside of my area so I'll defer to others about whether this is okay.

  • It replaces tildes (~) with the username component from the URL (if it has one), so using the example from the code comments:

    ssh://mona@example.com/~/LinkedList.git → example.com/~mona/LinkedList
    

    I have never seen this behaviour before; is there any precedent for it? Also, is the component after replacement ~mona or just mona? Given that the code comments appear to be the authoritative documentation for this "canonicalisation" transform, it's important that it is accurate.

  • Percent decoding happens as one of the last steps, but it doesn't validate the string after decoding. What if I included percent-encoded spaces or newlines? If that gets inserted as-is in to a GET /{package} request, the request then contains unvalidated user input and can be manipulated quite easily.

    Maybe this would get rejected by some other layer of SPM, but I don't think so - these are just opaque strings in an invented format, so this construction stage is the only validation that happens and no URL-parser would get invoked on them (I think). A proper analysis of that would require a larger effort, digging through the SPM codebase.

  • It is platform-dependent. Backslashes will be treated as path separators on Windows, but not on OSX or Linux. If we are sending this to a server and it is expecting to look up a particular canonical package identity string, that string cannot depend on the platform that is requesting the package.

  • "my conceit here is that a file path is actually a file:/// URL with an implicit scheme." No it isn't, and that kind of thinking is the path to buggy and brittle code. URL paths implement an abstract model of a tree of nodes, but they carry basically no meaning beyond "whatever the server wants to do with this string". That's the drawback of being "universal".

    There are all kinds of ways that this shows up in practice. For example, the backslash behaviour you implemented for file URLs is actually a property of file paths (which are necessarily OS and filesystem-specific). On Windows, it is considered to be a path separator. On Linux/macOS, it may be used to escape a space (e.g. /some/path/folder\ with\ spaces/). In URLs with special schemes, backslash and forwardslash are equivalent everywhere (e.g. http:\\www.example.com\some\path is the same as http://www.example.com/some/path).

  • .

    /// Swift Package Manager takes additional steps to canonicalize URLs
    /// to resolve insignificant differences between URLs.
    /// For example,
    /// the URLs `https://example.com/Mona/LinkedList` and `git@example.com:mona/linkedlist`
    /// are equivalent, in that they both resolve to the same source code repository,
    /// despite having different scheme, authority, and path components.
    

    I will need more time to process it, but as it stands I am pretty suspicious of this, just at a conceptual level.

  • I think some additional processing of the hostname may be required for this to work as intended. Hostnames with disallowed characters get processed differently based on their schemes (https: will get transformed by IDNA, ssh: will get percent-encoded), but the canonical identity has no knowledge of the scheme its hostname came from, so equivalent hostnames may be represented differently in this canonical form.

The question is whether they are truly the most robust way to represent a Swift package. I would argue that they are not. They are far too complex for our needs, and if we didn't use them, our infrastructure would be more robust with fewer points of possible failure.

Additionally, one of those things is not accurage:

  • Copy-pasting the URL in to your browser may take you to an entirely different site. See my examples in the previous post - Foundation's URL does not even match Safari on Apple's own platform. Also, I did investigate how that uXSS vulnerability was fixed, and it turns out the issue was indeed a mismatch between WebKit's URL type and Foundation (or CF, to be precise).

And the others do not require URLs:

  • TLS is a transport-level technology and does not require URLs.

  • We can still use DNS without URLs.

Maybe it will, maybe it won't. It is certainly plausible that it could be exploited (I suppose it depends what you mean by 'exploited', - perhaps it's more accurate to say that while spoofing usually means tricking a human being, this is more akin to spoofing your code and infrastructure). Even if it just manipulates requests to inject headers, isn't that bad enough? I'm not sure if there are any explicit safeguards against these kinds of things, except for whatever protections might accidentally fall out of the implementation.

A conclusive answer would take more time than I currently have, and clearly defined parameters for what we consider to be unacceptable manipulation of the process by a specially-crafted package.

Okay, well I think it might be worth considering, as part of this new registry system, whether that is still a good idea. For example, does it align with user's expectations? Are there other ways of assigning an identity to a package which are also compatible with not having a central naming authority and don't require taking on all of the baggage of URLs? And how do those solutions align with user's expectations?

For example, we already have issues with things like case-sensitivity. IDNA transformations are another source of difficulty. Basically, deciding whether 2 URLs are the same is non-trivial in Foundation's model (which doesn't do IDNA at construction), and we have no way of telling whether URLs which differ could point to the same package. We might be able to do something about that.

P.S: I just want to add that I really appreciate you and the other proposal authors taking this on. It’s an important development for SPM which is why I’m trying to really interrogate every detail.

3 Likes

+1. A huge thank you to everyone involved in this proposal.

Apologies if I missed this, but I didn’t see an affordance for “configure this machine to only fetch packages & source code through my company registry & git mirror”. Is that a supported use case in this proposal?

This would be similar to having a ~/.config/pip.conf file with ‘index-url: my company.pypi.mirror’

This would be beneficial for audited / regulated teams.

1 Like