SE-0292 (2nd review): Package Registry Service

tomerd · March 25, 2021, 7:38pm

The second review of SE-0292: Package Registry Service, begins now and runs through April 5, 2021.

Based on the feedback from the first review, the core team feels confident that the ideas behind Package Registry Service are useful and put the Swift packages ecosystem on the right path. One key area the core team asked to further explore in response to the first review, was the nature of the package identifiers given their potential utility in addressing Swift module name conflicts. The topic has been deeply explored in a discussion thread leading to the conclusion that opaque identifiers are the preferred solution for the Swift package ecosystem.

The proposal has been amended to include this decision, as well as address other feedback from the first review, and is now ready for a second review.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager or direct message in the Swift forums).

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for helping improve the Swift programming language and ecosystem.

Tom Doron
Review Manager

jberry · March 25, 2021, 8:59pm

While this revision makes progress with the concept of a scoped package identifier, I'm concerned that it does not go far enough in the area of package id lookup. In particular, and perhaps unintentionally, it seems to perpetuate a large anti-competitive bias in favor of GitHub, (which happens to be the employer of all three proposal authors).

Let's start by stipulating that GitHub is home to the large majority of open source projects available in the swift ecosystem. A large percentage of projects are-already and will-be hosted there.

According to my understanding of this revision, a user may configure, either globally or per-project, a list of registries in which to resolve package-ids, falling back to a default. Mappings from "scope" to registry outside of that default must be explicit. Given that so many projects live at GitHub, it will be natural to configure GitHub as the default registry.

So my concern is this: that the design presented in this revision perpetuates a bias toward GitHub. It will be the de-facto default registry because a high percentage of packages used by any particular user or project will likely be hosted at GitHub. To specify any non-github packages will require additional configuration, which will be tedious, and likely per-target-package; this biases choice of hosting toward GitHub, and against any existing or future alternatives.

SDGGiesbrecht · March 25, 2021, 9:15pm

Package scopes are case-insensitive (for example, mona ≍ MONA ). Package names are case-insensitive, diacritic-insensitive (for example, Å ≍ A ), and width-insensitive (for example, Ａ ≍ A ). Package names are compared using Normalization Form Compatible Composition (NFKC).

Diacritic‐insensitivity is a step too far. It only makes sense with a handful of scripts and makes a colossal mess of others.

NFKC is probably enough on its own. If case‐insensitivity is desired for some reason, it shouldn’t cause any problems, but I don’t see a real need for it.

The maximum length of a package name is 128 characters.

Is there a reason for this seemingly arbitrary restriction? Also, if this matters on the engineering side, then we are talking about 128 of what? I doubt I can stack two million acute accents on a letter E, still only have one character in Swift’s terminology, and all the while satisfy the restriction.

Srdan_Rasic · March 25, 2021, 9:22pm

I really like the proposal, although I'm not an expert in the topic. It seems to be providing an important alternative to hosting packages in Git repos.

I would assume that registries would be heavily used in private organisations. In that context, I would like to see more details on authentication/authorisation, namely regarding credentials. How would one pass auth credentials to the registry service? The proposal briefly mentions using the authority component of the URL, but also correctly identifies that as a security risk.

I think that it would be nice if the proposal addressed auth credentials in more details. One suggestion would be to consider supporting .netrc file, which SMP might already be supporting in some limited form if memory serves me well.

hisekaldma · March 25, 2021, 10:07pm

I agree completely with @SDGGiesbrecht on this – diacritic insensitivity makes no sense. To a Swedish speaker, the specific example given is also just plain wrong: Å is definitely not interchangable with A – “månad” means ‘month’, “manad” means ‘compelled’. They are the same letter as much as O and Q are the same letter.

mattt · March 25, 2021, 10:13pm

Our goal in designing this package identity scheme was to balance security, ergonomics, and expressivity. I think we agree on normalization insensitivity (it'd be unexpected for "Éclair" ≭ "E◌́clair") and width insensitivity.

For case sensitivity, this was selected for both security and ergonomic considerations. It's unlikely that a single user or organization would publish multiple packages that differ only in capitalization. Based on anecdotal evidence from other package registry systems, this most likely serve as a vector for typosquatting attacks.

For diacritic insensitivity, the primary consideration is ergonomics; users in a different locale may not be able to type an identifier with (what they might consider to be) exotic diacritics. I don't know of many examples of minimally contrastive pairs in which an accent distinguishes two words — at least to the extent that it'd be a problem (e.g. not being able to name one package "Papa" and another one "Papà"). That said, I don't feel as strongly about this point as the others, so I take your point.

I think some limit on name length is reasonable. This particular choice is arbitrary, but we thought it was reasonable. The number 128 refers to characters as described in UAX #31, which I understand to be code points not extended grapheme clusters.

To clarify, a single default registry may be defined either locally, at the project level (./.swiftpm/config/registries.json), or globally, at the user level (~/.swiftpm/config/registries.json). Users may optionally configure a registry — either locally or globally — for all packages within a single scope (i.e. mona in mona.LinkedList) to resolve with another registry.

We describe use cases for custom registries in our proposal here:

This proposal adds a new swift package-registry subcommand for managing the registry used for all packages and/or packages in a particular scope.

Custom registries can serve a variety of purposes:

Private dependencies : Users may configure a custom registry for a particular scope to incorporate private packages with those fetched from a public registry.

Geographic colocation : Developers working under adverse networking conditions can host a mirror of official package sources on a nearby network.

Policy enforcement : A corporate network can enforce quality or licensing standards, so that only approved packages are available through a custom registry.

Auditing : A custom registry may analyze or meter access to packages for the purposes of ranking popularity or charging licensing fees.

Thanks for this suggestion. I honestly thought that .netrc files were discussed in the proposal, but perhaps I was confusing that with an earlier forum thread. I agree that they would be a good alternative to URLs with hardcoded credentials.

taylorswift · March 25, 2021, 10:25pm

i don’t really think it’s the swift community’s job to be antitrust enforcers. the swift project itself, including the package manager is already hosted on github, and there seems to be no serious intention or reason to migrate off of it.

on the other hand, if we are concerned about “bias”, then the following passage from the proposal is pretty alarming to me:

Package names are case-insensitive, diacritic-insensitive (for example, Å ≍ A ), and width-insensitive (for example, Ａ ≍ A ). Package names are compared using Normalization Form Compatible Composition (NFKC).

stripping diacritics for the sake of preventing typosquatting is just wrong, from both a technical, and an ethical perspective. the argument that “security” justifies imposing such a backwards-looking restriction is not a new one, nor has it ever been exclusive to software development.

on a slightly different note, i think this sentence is a typo

The maximum length of a package name is 39 characters. A valid package scope matches the following regular expression pattern:

since a few sentences later, it says the character limit is 128.

just adding to this, swift also allows cyrillic identifiers, including the cyrillic А that for some reason has been used as an example in the Security section:

Package scopes are restricted to a limited set of characters, preventing homograph attacks. For example, "А" (U+0410 CYRILLIC CAPITAL LETTER A) is an invalid scope character and cannot be confused for "A" (U+0041 LATIN CAPITAL LETTER A).

inclusivity does not just mean western europe!

lorentey · March 25, 2021, 10:36pm

If we feel this is a serious issue, then restricting identifiers to printable ASCII would be the right call. Arbitrarily deciding to treat "exotic"(?!) diacritics as non-important, while carefully distinguishing between slight emoji variations is a questionable call.

In practice, people will simply copy and paste package identifiers whose names they can't comfortably type on their keyboard.

Signed,
Lőrentey Károly

Edit: I just realized that most emoji aren't in ID_Start or ID_Continue, so they won't be allowed in package identifiers -- that sounds like an excellent idea to me!

My point on singling out diacritics still stands though -- diacritics are not at all optional decorations, and diacritic characters are neither more nor less objectively easy to type (or visually distinguish) than Hanzi/Devanagari/Cyrillic scripts (or indeed literally any script used by any human ever).

Diacritics aren't even particularly confusable -- which misspelling is easier to spot: "ÁS̈CÍÍ" or "ASCll"? (Hint: the latter ends with two ells.)

jberry · March 25, 2021, 10:40pm

Thanks for attempting to clarify. I don't believe your clarification differs substantially from my characterization, and not from my understanding.

But my concern remains. Let's say I have a project that includes five packages, a.a, b.b, c.c., e.e, and f.f. And let's say that a.a, b.b, and c.c are all hosted at GitHub. Great, so I'll rely on the default I've configured for that. But to utilize the final two packages e.e and f.f I would have to configure two additional registries, one specific to each of those. Again, the requirement for this additional configuration creates a bias against anything not hosted at GitHub.

My preference would be to create an additional level of indirection at the level of the package-id name lookup. A simple package-id name registry service that would map a package-id into the url for its registry. Presumably there would be a single well-known such entity, though there could be additional name registries. A search path through such name registries could be specified by the user at global or project level to specify overrides. And perhaps with scope-based and project-id based overrides as a final granular level of override.

Let me make another observation as well. Given the current revision of the proposal, if the author of the project a.a as described above choses at some point to relocate their package from GitHub to SomeOtherHub, then any current users of a.a would likely have to reconfigure their custom registries. If there was another level of indirection in the mapping, then the package author could simply change the mapping of their package-id to the new registry url in the name registry.

jberry · March 25, 2021, 10:48pm

I like GitHub. It's a good product and works well. But I think it's important that we as a community understand that in accepting this design we are de-facto making GitHub our registry, and making it hard for any other registries to ever compete. I would much prefer a design that had another level of indirection driven by a simple network-based name lookup service that maps package-ids into registry urls. Such a design would flatten the competitive landscape and provide more flexibility.

Yes, I know there are questions of who would run such a name registry. Perhaps GitHub would do so, behind an appropriately swifty vanity-url so that it could be moved at some point if GitHub were to implode. Having this integrated into GitHub could also solve the complexity problem, at least for users of GitHub: maybe there would be a per-project defaulted setting that would automatically populate the name-registry for the project.

SDGGiesbrecht · March 25, 2021, 11:19pm

The definition of “diacritic” is somewhat fuzzy. I searched for the technical definition of what ICU does, but was unable to find it. The European diacritics weren’t really my concern (as much as they are important), but rather all the abigudas of Southeast Asia, such as Hindi. If the definition of “diacritic” simply means a combining class other than 0, then the abigudas become completely unintelligible, because every vowel is a combining character, as well as gemination and such. As an English comparison, imagine if “bait” ≍ “bat” ≍ “batty” ≍ “beat” ≍ “beet” ≍ “bet” ≍ “Betty” ≍ “bight” ≍ “bit” ≍ “bite” ≍ “boat” ≍ “boot” ≍ “bot” ≍ “bout” ≍ “but” ≍ “butt” ≍ “byte”...

I don’t know of a concrete example of a Hindi project, but I do know of entire code bases I cannot read because they are entirely in Korean. I cannot type a single one of their identifiers. I don’t use them and I never care to. But I do know that even if Swift permitted only ASCII everywhere, those code bases would still be in Korean, just a very awkward version of it clumsily transliterated into Latin letters. It would still be hopeless for me to try to understand what their code does or how to use their library. I would need a translation shim either way. (I don’t use the Portuguese code bases I’ve come across for the same reason, even though I do recognize the individual letters.) So I say let everyone else use their language the natural way and don’t worry about it. It’s easier for them and there are no negative effects for me.

P.S. Yes, we do agree here:

NeoNacho · March 26, 2021, 2:14am

I would say the design already absolutely allows this, someone could set up a registry that essentially acts as a proxy by implementing the registry API and calling out to other registries to serve the actual data.

Since the configuration is per user, not per package, it also seems tractable to me that this could happen at any time after GH hypothetically became a common choice for the default registry.

tim1724 · March 26, 2021, 2:27am

If the proposal means code points it should probably say code points. In the context of Swift the word "character" does not usually mean code point.

jberry · March 26, 2021, 2:36am

Yes, somebody could set up a registry that proxies other registries. But recognize that such a registry proxy would suffer the bidirectional bandwidth cost of the fully loaded traffic, likely making it quite expensive to run and not a good default: certainly out of the realm of a community solution unless it had a deep pocketed sponsor. The alternative I suggest of a simple name lookup service would be a lot more resource efficient.

Edit: (Perhaps my statement above is false if the proxy registry is able to simply redirect for each of the proxied calls, in which case the bandwidth is more manageable).

mattt · March 26, 2021, 3:01am

First, let me say this up front: I agree with you and the other folks on this thread who think diacritic insensitive comparisons aren't a good idea. I think the proposal should be amended to remove that before it's adopted. Thanks to you, @hisekaldma, @wtedst, @lorentey, and @taylorswift for the helpful feedback y'all provided.

One of the things I like most about Swift is its support for Unicode throughout the language. So when it came time to design an identity scheme for packages, I wanted to preserve that as much as I could. For better or for worse, ASCII-only is the norm for package ecosystems, so there's not much in the way of prior art. To that end, I'm very thankful for the resources like UAX #31, which do a great job of explaining the challenges and strategies for working in this complex domain.

Responding to your post, @SDGGiesbrecht:

I found this definition from the ICU glossary (which appears to have recently gotten revamped ):

Diacritic
A modifying mark on a character. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with accent.

For Devanagari / Indic scripts and maybe other abugidas, my understanding is that characters are combined using ZWJs. [1]
Thanks as always for the opportunity to geek out about Unicode

You're absolutely correct. That should be "scope" not "name". Thanks for pointing this out!

Depending on the context, Unicode documentation may refer to a "character" as either an individual code point or an extended grapheme cluster. The former "code point / scalar" meaning is, by my reading, the one typically used in programming contexts. To wit, our proposal mentions XID_START and XID_CONTINUE character properties, which only make sense if we're talking about members in the Unicode Character Database.

My hope was that the regular expression would clear up any ambiguity. But if this remains a source of confusion, I can revisit this language in the proposal before it's adopted.

xwu · March 26, 2021, 4:13am

The amended design is certainly more in line with the core team's requested exploration, and I think it is the right step. Some initial questions/comments:

I'm glad the issue with diacritics has been thoroughly discussed. It's nice that being 8 hours late to the party means I don't have to write more on that.

I have to say that I agree with @SDGGiesbrecht: I don't see the point of case-insensitive comparisons. Moreover, the security and ergonomic considerations you name argue against rather than for case folding, which could create rather than remove the very security problem you're trying to avoid:

Consider that each scope and package name will be subject to the policies of the registries that they are hosted on, and recall that there are locale-dependent case changes (cf. Turkish dotted i). If SwiftPM additionally applies its own locale-independent case folding, then a user could specify a {scope} and package name exactly correctly, and SwiftPM could apply case changes that are different from those of the registry's, thereby requesting a malicious squatter's code from the registry instead.

How so? Swift has never (to my knowledge) enforced any sort of maximum on package name length, nor of any variable or type. The length of any name is going to be bounded by registries to some maximum anyway, and more saliently, it will be bounded by what users will be willing to type into their package manifests.

Put another way, the question would be: Why is it important to you that every scope and every package name, regardless of where they are hosted, be subject to the same length limitation? Is there some attack vector made possible by exceedingly long package names?

Finally, Swift does not offer an NFKC normalization API; would we need to do so for this proposal to be fully implemented?

owenv · March 26, 2021, 4:52am

This is looking great! I do have some concerns about communicating best practices for migrating from git URLs to package IDs though. It seems like changing a package dependency description to use a registry ID should always be a major version bump, because otherwise it will break any packages that depend on it and don't have the same set of registries configured. I could see this becoming a pretty big issue if, say, the switch is made in a patch version of a really popular transitive dependency. Because there's no way to determine which registry the ID was intended to be fetched from, errors like this also won't be very actionable without consulting READMEs, documentation, etc. Would it be worth adding an optional parameter to the .package(...) declaration for a 'recommended' registry that wouldn't impact resolution at all, but could be recommended to the user if the process fails (e.g error: cannot resolve package 'mona.LinkedList'; run swift package-registry set --scope ... to use the registry recommended by the author of XYZ). I'm not sure if that's a great idea, but it might help a little.

SDGGiesbrecht · March 26, 2021, 5:36am

Expand to see minor corrections to some technical Unicode details above. Intended for @mattt and anyone interested. Likely boring and unimportant to most readers.

ZWJ is only when you want to coax out an incomplete glyph in isolation without the rest of the construct. (The ZWJ interacts as though it were an invisible vowel component.) Normal usage just strings the construct components together.

I did double‐check the UCD and I was wrong earlier about them having a combining class of 0. They use an entirely different system to determine what is a base and what is combining. Again, the presence of competing systems and definitions makes me wary of diacritic insensitivity. We cannot know if the algorithm is stable if we cannot even find documentation for it. Imagine if two packages work side‐by‐side until the don’t anymore, just because ICU adjusts the algorithm. (The other kinds of folding discussed here—NFKC and case folding—both have stability guarantees from Unicode.)

stevapple · March 26, 2021, 8:56am

I have the same worries as @owenv. Unlike the original proposal, this one completely breaks up git-based and registry-based package retrieving, which makes it painful to be adopted because this definitely breaks backward compatibility. I think there should be some way to prefer registries while keeping git compatibility, which would guide SwiftPM users to migrate and align their practices of declaring package dependencies.

johannesweiss · March 26, 2021, 9:02am

I'm totally in favour if this proposal, seems like the right move to me.

regarding the NFKC normalisation: +1, assuming we actually allow non-ASCII
diacritic insensitivity / non-ASCII: I don't feel strongly about it but I'd personally go for a subset of ASCII (eg. [a-z0-9-]+ or so). Allowing non-ASCII seems like an unchartered territory and a bit of a minefield (both on the technical level and the cultural level). Having something as complex as this is bound to create implementation differences across the registries so I reckon only ASCII will be fully safe on all registries anyway. Complexity also usually fosters security vulnerabilities.
regarding the bias towards GitHub: I think this new version makes a hugely important step to reduce the bias towards GitHub. With this proposal you can freely move your package say from GitHub to GitLab and everything keeps working with the same fidelity. That didn't use to be the case in the first version because the registry was found at the the same domain as the repo URL (which was very hard to change). I will however say that I would've appreciated if the proposal mentioned that the authors are IIUC all associated with GitHub.
We should add a requirement for any registry that a user must be able to freely adjust the Link: <URL>; rel="canonical" at any point. So for example I think the GitHub registry must allow to change the rel="canonical" of a project to say GitLab or any other host.