SE-0292 (2nd review): Package Registry Service

i don’t really think it’s the swift community’s job to be antitrust enforcers. the swift project itself, including the package manager is already hosted on github, and there seems to be no serious intention or reason to migrate off of it.

on the other hand, if we are concerned about “bias”, then the following passage from the proposal is pretty alarming to me:

Package names are case-insensitive, diacritic-insensitive (for example, ÅA ), and width-insensitive (for example, A ). Package names are compared using Normalization Form Compatible Composition (NFKC).

stripping diacritics for the sake of preventing typosquatting is just wrong, from both a technical, and an ethical perspective. the argument that “security” justifies imposing such a backwards-looking restriction is not a new one, nor has it ever been exclusive to software development.

on a slightly different note, i think this sentence is a typo

The maximum length of a package name is 39 characters. A valid package scope matches the following regular expression pattern:

since a few sentences later, it says the character limit is 128.

just adding to this, swift also allows cyrillic identifiers, including the cyrillic А that for some reason has been used as an example in the Security section:

Package scopes are restricted to a limited set of characters, preventing homograph attacks. For example, "А" (U+0410 CYRILLIC CAPITAL LETTER A) is an invalid scope character and cannot be confused for "A" (U+0041 LATIN CAPITAL LETTER A).

inclusivity does not just mean western europe!

3 Likes

If we feel this is a serious issue, then restricting identifiers to printable ASCII would be the right call. Arbitrarily deciding to treat "exotic"(?!) diacritics as non-important, while carefully distinguishing between slight emoji variations is a questionable call.

In practice, people will simply copy and paste package identifiers whose names they can't comfortably type on their keyboard.

Signed,
Lőrentey Károly

Edit: I just realized that most emoji aren't in ID_Start or ID_Continue, so they won't be allowed in package identifiers -- that sounds like an excellent idea to me!

My point on singling out diacritics still stands though -- diacritics are not at all optional decorations, and diacritic characters are neither more nor less objectively easy to type (or visually distinguish) than Hanzi/Devanagari/Cyrillic scripts (or indeed literally any script used by any human ever).

Diacritics aren't even particularly confusable -- which misspelling is easier to spot: "ÁS̈CÍÍ" or "ASCll"? (Hint: the latter ends with two ells.)

15 Likes

Thanks for attempting to clarify. I don't believe your clarification differs substantially from my characterization, and not from my understanding.

But my concern remains. Let's say I have a project that includes five packages, a.a, b.b, c.c., e.e, and f.f. And let's say that a.a, b.b, and c.c are all hosted at GitHub. Great, so I'll rely on the default I've configured for that. But to utilize the final two packages e.e and f.f I would have to configure two additional registries, one specific to each of those. Again, the requirement for this additional configuration creates a bias against anything not hosted at GitHub.

My preference would be to create an additional level of indirection at the level of the package-id name lookup. A simple package-id name registry service that would map a package-id into the url for its registry. Presumably there would be a single well-known such entity, though there could be additional name registries. A search path through such name registries could be specified by the user at global or project level to specify overrides. And perhaps with scope-based and project-id based overrides as a final granular level of override.

Let me make another observation as well. Given the current revision of the proposal, if the author of the project a.a as described above choses at some point to relocate their package from GitHub to SomeOtherHub, then any current users of a.a would likely have to reconfigure their custom registries. If there was another level of indirection in the mapping, then the package author could simply change the mapping of their package-id to the new registry url in the name registry.

I like GitHub. It's a good product and works well. But I think it's important that we as a community understand that in accepting this design we are de-facto making GitHub our registry, and making it hard for any other registries to ever compete. I would much prefer a design that had another level of indirection driven by a simple network-based name lookup service that maps package-ids into registry urls. Such a design would flatten the competitive landscape and provide more flexibility.

Yes, I know there are questions of who would run such a name registry. Perhaps GitHub would do so, behind an appropriately swifty vanity-url so that it could be moved at some point if GitHub were to implode. Having this integrated into GitHub could also solve the complexity problem, at least for users of GitHub: maybe there would be a per-project defaulted setting that would automatically populate the name-registry for the project.

3 Likes

The definition of “diacritic” is somewhat fuzzy. I searched for the technical definition of what ICU does, but was unable to find it. The European diacritics weren’t really my concern (as much as they are important), but rather all the abigudas of Southeast Asia, such as Hindi. If the definition of “diacritic” simply means a combining class other than 0, then the abigudas become completely unintelligible, because every vowel is a combining character, as well as gemination and such. As an English comparison, imagine if “bait” ≍ “bat” ≍ “batty” ≍ “beat” ≍ “beet” ≍ “bet” ≍ “Betty” ≍ “bight” ≍ “bit” ≍ “bite” ≍ “boat” ≍ “boot” ≍ “bot” ≍ “bout” ≍ “but” ≍ “butt” ≍ “byte”...

I don’t know of a concrete example of a Hindi project, but I do know of entire code bases I cannot read because they are entirely in Korean. I cannot type a single one of their identifiers. I don’t use them and I never care to. But I do know that even if Swift permitted only ASCII everywhere, those code bases would still be in Korean, just a very awkward version of it clumsily transliterated into Latin letters. It would still be hopeless for me to try to understand what their code does or how to use their library. I would need a translation shim either way. (I don’t use the Portuguese code bases I’ve come across for the same reason, even though I do recognize the individual letters.) So I say let everyone else use their language the natural way and don’t worry about it. It’s easier for them and there are no negative effects for me.


P.S. Yes, we do agree here:

10 Likes

I would say the design already absolutely allows this, someone could set up a registry that essentially acts as a proxy by implementing the registry API and calling out to other registries to serve the actual data.

Since the configuration is per user, not per package, it also seems tractable to me that this could happen at any time after GH hypothetically became a common choice for the default registry.

1 Like

If the proposal means code points it should probably say code points. In the context of Swift the word "character" does not usually mean code point.

Yes, somebody could set up a registry that proxies other registries. But recognize that such a registry proxy would suffer the bidirectional bandwidth cost of the fully loaded traffic, likely making it quite expensive to run and not a good default: certainly out of the realm of a community solution unless it had a deep pocketed sponsor. The alternative I suggest of a simple name lookup service would be a lot more resource efficient.

Edit: (Perhaps my statement above is false if the proxy registry is able to simply redirect for each of the proxied calls, in which case the bandwidth is more manageable).

First, let me say this up front: I agree with you and the other folks on this thread who think diacritic insensitive comparisons aren't a good idea. I think the proposal should be amended to remove that before it's adopted. Thanks to you, @hisekaldma, @wtedst, @lorentey, and @taylorswift for the helpful feedback y'all provided.

One of the things I like most about Swift is its support for Unicode throughout the language. So when it came time to design an identity scheme for packages, I wanted to preserve that as much as I could. For better or for worse, ASCII-only is the norm for package ecosystems, so there's not much in the way of prior art. To that end, I'm very thankful for the resources like UAX #31, which do a great job of explaining the challenges and strategies for working in this complex domain.


Responding to your post, @SDGGiesbrecht:

I found this definition from the ICU glossary (which appears to have recently gotten revamped :nail_care:):

Diacritic
A modifying mark on a character. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with accent.

For Devanagari / Indic scripts and maybe other abugidas, my understanding is that characters are combined using ZWJs. [1]
Thanks as always for the opportunity to geek out about Unicode :clap:

You're absolutely correct. That should be "scope" not "name". Thanks for pointing this out!

Depending on the context, Unicode documentation may refer to a "character" as either an individual code point or an extended grapheme cluster. The former "code point / scalar" meaning is, by my reading, the one typically used in programming contexts. To wit, our proposal mentions XID_START and XID_CONTINUE character properties, which only make sense if we're talking about members in the Unicode Character Database.

My hope was that the regular expression would clear up any ambiguity. But if this remains a source of confusion, I can revisit this language in the proposal before it's adopted.

4 Likes

The amended design is certainly more in line with the core team's requested exploration, and I think it is the right step. Some initial questions/comments:


I'm glad the issue with diacritics has been thoroughly discussed. It's nice that being 8 hours late to the party means I don't have to write more on that.


I have to say that I agree with @SDGGiesbrecht: I don't see the point of case-insensitive comparisons. Moreover, the security and ergonomic considerations you name argue against rather than for case folding, which could create rather than remove the very security problem you're trying to avoid:

Consider that each scope and package name will be subject to the policies of the registries that they are hosted on, and recall that there are locale-dependent case changes (cf. Turkish dotted i). If SwiftPM additionally applies its own locale-independent case folding, then a user could specify a {scope} and package name exactly correctly, and SwiftPM could apply case changes that are different from those of the registry's, thereby requesting a malicious squatter's code from the registry instead.


How so? Swift has never (to my knowledge) enforced any sort of maximum on package name length, nor of any variable or type. The length of any name is going to be bounded by registries to some maximum anyway, and more saliently, it will be bounded by what users will be willing to type into their package manifests.

Put another way, the question would be: Why is it important to you that every scope and every package name, regardless of where they are hosted, be subject to the same length limitation? Is there some attack vector made possible by exceedingly long package names?


Finally, Swift does not offer an NFKC normalization API; would we need to do so for this proposal to be fully implemented?

3 Likes

This is looking great! I do have some concerns about communicating best practices for migrating from git URLs to package IDs though. It seems like changing a package dependency description to use a registry ID should always be a major version bump, because otherwise it will break any packages that depend on it and don't have the same set of registries configured. I could see this becoming a pretty big issue if, say, the switch is made in a patch version of a really popular transitive dependency. Because there's no way to determine which registry the ID was intended to be fetched from, errors like this also won't be very actionable without consulting READMEs, documentation, etc. Would it be worth adding an optional parameter to the .package(...) declaration for a 'recommended' registry that wouldn't impact resolution at all, but could be recommended to the user if the process fails (e.g error: cannot resolve package 'mona.LinkedList'; run swift package-registry set --scope ... to use the registry recommended by the author of XYZ). I'm not sure if that's a great idea, but it might help a little.

1 Like
Expand to see minor corrections to some technical Unicode details above. Intended for @mattt and anyone interested. Likely boring and unimportant to most readers.

ZWJ is only when you want to coax out an incomplete glyph in isolation without the rest of the construct. (The ZWJ interacts as though it were an invisible vowel component.) Normal usage just strings the construct components together.

I did double‐check the UCD and I was wrong earlier about them having a combining class of 0. They use an entirely different system to determine what is a base and what is combining. Again, the presence of competing systems and definitions makes me wary of diacritic insensitivity. We cannot know if the algorithm is stable if we cannot even find documentation for it. Imagine if two packages work side‐by‐side until the don’t anymore, just because ICU adjusts the algorithm. (The other kinds of folding discussed here—NFKC and case folding—both have stability guarantees from Unicode.)

4 Likes

I have the same worries as @owenv. Unlike the original proposal, this one completely breaks up git-based and registry-based package retrieving, which makes it painful to be adopted because this definitely breaks backward compatibility. I think there should be some way to prefer registries while keeping git compatibility, which would guide SwiftPM users to migrate and align their practices of declaring package dependencies.

I'm totally in favour if this proposal, seems like the right move to me.

  • regarding the NFKC normalisation: +1, assuming we actually allow non-ASCII
  • diacritic insensitivity / non-ASCII: I don't feel strongly about it but I'd personally go for a subset of ASCII (eg. [a-z0-9-]+ or so). Allowing non-ASCII seems like an unchartered territory and a bit of a minefield (both on the technical level and the cultural level). Having something as complex as this is bound to create implementation differences across the registries so I reckon only ASCII will be fully safe on all registries anyway. Complexity also usually fosters security vulnerabilities.
  • regarding the bias towards GitHub: I think this new version makes a hugely important step to reduce the bias towards GitHub. With this proposal you can freely move your package say from GitHub to GitLab and everything keeps working with the same fidelity. That didn't use to be the case in the first version because the registry was found at the the same domain as the repo URL (which was very hard to change). I will however say that I would've appreciated if the proposal mentioned that the authors are IIUC all associated with GitHub.
  • We should add a requirement for any registry that a user must be able to freely adjust the Link: <URL>; rel="canonical" at any point. So for example I think the GitHub registry must allow to change the rel="canonical" of a project to say GitLab or any other host.
5 Likes

This proposal looks good to me, it seems to cover a wide range of needs. I do have a few notes.

The JSON format means that if anyone wants to apply a scope with the name "default" it will not be possible to express a specific registry url for it. Presumably we should simply forbid the scope name "default" and require all registries to ignore it? Alternatively, the SwiftPM sigil for the default scope in the .swiftpm/config/registries.json file should use a different sigil that is not a valid scope name (perhaps *?)

Nit: The URLs in the section "Set-mirror option for package identifiers" are ill-formed: they begin https:/// (an extra slash).

There has been a discussion up-thread about using .netrc files for credentials. I think this is a good idea and would go further and say that URLs with userinfo components should not be added to the registry file at all (error on configuration). This substantively mitigates the need to police against secrets leaking in these files.

Regarding package name length, it's worth noting that even if this proposal does not impose one there will be a practical upper name length, gated by implicitly by the maximum size of the HTTP request target allowed by the union of proxies and origin server running the registry. That size is substantially larger than 128 characters (tending to cluster around 2000 for the full request-target, implying a length of around 1000 is almost certainly practical).

10 Likes

From the proposal

A formal specification for the package registry interface is provided alongside this proposal.

This was a problem with the last proposal, as well - there are no links to this specification that I could find from the document. It's really important that we consider both of these as being up for review together, IMO. I can't really review this until I see them both.


Just based on what is here, though: I like the idea of having package identities which are not tied to URLs. It's shockingly convenient to be able to say "this package depends on apple.swiftnio" and have things somehow... just work. You could imagine how great this would be if/when scripts are able to declare package dependencies.

One thing I am concerned about is privacy. How does SwiftPM know which registries contain which packages? Does it send the package IDs in plaintext? Because broadcasting your dependencies to every registry you have configured, fishing for one that claims to have it, is really not very private at all.

It would be great if there was a way to send an obfuscated version of the package ID, such that only the registry that holds the package would be able to know which package you're actually referring to. I'm not a crypto expert, so I'm not sure if that's possible. My concern here is less about GitHub, and more about oppressive governments who might spy on your internet traffic and be very interested in which of their residents are making use of things such as encryption technology hosted abroad.

Also, as I mentioned in the previous review, we should work to reject the user info component entirely as soon as possible. GitHub's strategy of embedding OAuth tokens in there is seriously flawed and they need to stop that ASAP - they actually recommend that you put the token in the username component! :man_facepalming:

2 Likes

Unicode case folding is context-insensitive and language-independent unless you specifically opt-in to Turkic mappings. [1] [2]

Would you have any concerns with the following transformations if we called it out more explicitly?

{Name} -> NKFC -> XID_ filter -> locale-independent case folding


Swift Package Manager should be able to resolve a mixed package graph with identifiers and URLs, because the API responses from the registry provide mappings between the two. If a transitive dependency switched from URLs to IDs, everything would continue to work as expected.

Developers would continue to be able to resolve external dependencies using Git. The mappings provided in the API responses to GET /{scope}/{package} and GET /identifiers{?url} provide a migration path for adopting package identifiers.

Thanks for pointing that out. Our original format for package identifiers was @mona/LinkedList, but the Swift core team made a request to change that shortly before it came up again for review. As such, I haven't had much opportunity to think through the full implications of that change.

You're correct to point out the need for "default" to be spelled in such a way that it's not a valid scope name. I think [default] would be a nice alternative, but I'm happy to bikeshed that.

According to the current HEAD of main on apple/swift-package-manager (bbcfe08):

// --netrc-file option only supported on macOS >=10.13

Until .netrc is supported on all platforms, I don't think we can remove support for URL-encoded credentials.

Maximum URL sizes in HTTP payloads were my primary motivation for enforcing a size. Given the inconsistent practical limits in client and server HTTP libraries, I think it makes sense to enforce such a limit in the spec.

Aside from that, putting reasonable upper bounds on the size simplifies other implementation details. It's nice for setting column constraints in the database (and avoiding TOAST). It also protects against ReDos attacks.

128 is indeed more conservative than what could safely fit in an HTTP payload, but comfortably large to accomodate the existing ecosystem. For instance, the longest name in SwiftPackageIndex/PackageList is 61. If anyone can make the case for allowing names longer than the proposed 128, I'd be very interested to hear it.

Apologies for the inconvenience here — I couldn't think of a good way to durably link between these documents while they were still in-flight. The server specification and OpenAPI document will be located in the @apple/swift-package-manager repository. (Direct link to the spec).

Answering your questions individually:

  • Swift Package Manager doesn't know about the existence of registries beyond what's configured. And it doesn't know what those registries contain, except by virtue of requesting each one and either getting a 200 or a 404.
  • Our proposal requires client and server to communicate over a secure HTTPS connection. The package ID is encoded in the URL path, which is encrypted.
  • Each package is requested individually, in isolation from one another. If a default registry and a scoped registry were configured, the scoped registry would know only what packages you requested for that scope, and the default registry would know only about packages without that scope.

This is guaranteed by the transport-level security provided by HTTPS. If you wanted stronger or guarantees about privacy, you could configure an intermediate proxy server, mirror packages on a registry using an obfuscated identifier, or vendor all dependencies to avoid fetching external dependencies entirely.

2 Likes

Perhaps I didn’t express the issue sufficiently clearly. It does not matter what we do here; it can be absolutely “perfect,” but unless every registry uses the exact same transformations (and they will not, because sites like GitHub are free to determine their own requirements for user names and repository names), then any transformation applied by SwiftPM, however sensible, will create opportunities for malicious actors to squat the transformed name on a registry where there is a discrepancy between the transformed name and the original name from the perspective of the registry’s transforms.

For example, a Turkish repository may (and very sensibly) opt into Turkic mappings. Then SwiftPM, by applying a locale independent case folding, will create a security problem where none need exist.

2 Likes

Sorry for misunderstanding your concern. Two follow-up points to try to get on the same page:

I tried to translate that thought experiment into code, but I'm having trouble understanding what you mean:

import Foundation

// Package name declared as dependency
let name = "I" // LATIN CAPITAL LETTER I (U+0049)

// Registry with locale-independent case folding
let correct = name.folding(options: .caseInsensitive, locale: nil) // LATIN SMALL LETTER I (U+0069)

// Registry with locale-dependent case folding
let incorrect = name.folding(options: .caseInsensitive, locale: Locale(identifier: "tr_TR")) // LATIN SMALL LETTER DOTLESS I (U+0131)

// Registry returning arbitrary, incorrect string
let arbitrary = "foo"

// Swift Package Manager validating name of package returned by registry 
name.compare(correct, options: .caseInsensitive, locale: nil) == .orderedSame // true
name.compare(incorrect, options: .caseInsensitive, locale: nil) == .orderedSame // false
name.compare(arbitrary, options: .caseInsensitive, locale: nil) == .orderedSame // false

In our proposal, package names aren't necessarily derived from repository names, so I'm not sure how GitHub's (or any other hosts') naming policies for repositories would impact how package identities are resolved.

Small correction: we do have NSString.precomposedStringWithCompatibilityMapping in Foundation.