Extend SwiftPM `PackageDescription` to introduce metadata

mattt · June 24, 2020, 4:21pm

I agree that we should have a conventional way to represent package information. However, I think it'd be a mistake to invest in a new, Swift-specific solution.

Schema.org provides a comprehensive standard for representing project metadata through its SoftwareSourceCode data type.

As discussed in our pitch for a Swift package registry specification:

Swift Package Registry Service

A server MAY include additional fields in its response. It is RECOMMENDED that package metadata be represented in JSON-LD according to a structured data standard. For example, this response using the Schema.org SoftwareSourceCode vocabulary:

{
  "@context": ["http://schema.org/"],
  "@type": "SoftwareSourceCode",
  "identifier": "@mona/LinkedList",
  "name": "LinkedList",
  "description": "One thing links to another.",
  "keywords": ["data-structure", "collection"],
  "version": "1.1.1",
  "codeRepository": "https://github.com/mona/LinkedList",
  "license": "https://www.apache.org/licenses/LICENSE-2.0",
  "readme": "https://github.com/mona/LinkedList/blob/master/README.md",
  "issueTracker": "https://github.com/mona/LinkedList/issues",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "Swift",
    "url": "https://swift.org"
  },
  "author": {
      "@type": "Person",
      "@id": "https://github.com/mona",
      "givenName": "Mona",
      "middleName": "Lisa",
      "familyName": "Octocat"
  }
}

By representing project metadata as linked data, it can be understood by machines and integrated with other linked data. Also, the specific choice of encoding is less important; the most obvious choice here would be JSON-LD, but the same information could be represented as RDF.

We could spend a lot of time coming up with our own solution here, but I don't think that we'd end up with anything more capable than the Schema.org definition.

Jon_Shier · June 24, 2020, 4:28pm

To be clear Mattt, are you suggesting an external file that supports this schema, supporting this schema items from the Package.swift, or just suggesting that whatever solution, we follow this schema?

mattt · June 24, 2020, 4:36pm

My primary argument here is that metadata should follow an existing linked data standard, like SoftwareSourceCode, rather than inventing something new. Given the extensible nature of linked data, representing this in Swift would be difficult, so my preference for an external file follows from that first point (though if we really wanted to put this Package.swift, I guess we could encode JSON-LD in a string literal).

Jack_Newcombe · June 25, 2020, 7:58am

+1 for following a standard and keeping it as a separate metadata file.

One big advantage I can see for a separate metadata file is that it can be shared with other package managers like Cocoapods and Carthage and not have to be maintained in multiple places.

I would go as far as to say that some of the existing metadata could be moved out of Package.swift and into this new metadata file - including the package name - and PackageDescription could be extended to allow either the existing distinct metadata fields in the constructor for spm-only repos (as it currently works), or a metadata flag/filename/etc. for repos supporting multiple PMs.

GetSwifty · June 26, 2020, 4:01pm

I've been thinking through this and I first want to make sure I understand the intended purpose.

By definition metadata isn't (or shouldn't be) necessary to have a working/compiling package and should be entirely optional. A README usually contains any additional info in terms of a human-readable format. Since that format is supported as-is by most git-hosts as well as Xcode, it seems pointless to duplicate that information.

I think that narrows down the purpose to automated parsing and indexing. Is there another use case I should be considering?

My main concern around putting this info in the manifest is it would chain any metadata changes to SPM/Swift evolution and versioning. If an indexer gets popular they will likely want/need to add additional fields without going through this process. I also would expect them to want additional metadata specific to their use case that wouldn't belong in the manifest. This could be supported by an other option that allows raw entry, but that kind of defeats the purpose of a manifest standard.

In that context a separate file would probably be better and preferably in a format that allows non-standard fields. However at that point I'm unsure what the utility is for having an official standard. Especially before anything needing the metadata has taken off.

Would another option be to, rather than define a standard, support linking metadata files?

metadata: [
    .readme(path: "README.md"),
    .license(path: "LICENSE"),
    .other(path: "SPIMetadata.yml"),
]

This has some advantages

Concisely referring to additional info without mucking up the manifest
Remove some of the "magic" a lot of these files currently have
Avoids any localization issues
Avoids format wars
Allows standards to emerge and evolve naturally
Allows private or non-standard uses

It would also codifiy existing emergent standards while allowing more flexibility

metadata: [
    .readme("instructions/SPM_README.md"), // Allows path flexibility and creating an SPM specific README
    .license(), // Path could default to "LICENSE"
]

And if standards are established (or existing standards adopted), it could easily be extended to indicate a standard without requiring that standard to be defined within SPM.

metadata: [
    .license(type: .mit2), // path defaults to "LICENSE"
    .other(path: "metadata/SPIMetadata.yml", type: .spi),
    .other(path: "metadata/ssc.json", type: .softwareSourceCode),
]

daveverwer · June 27, 2020, 5:46pm

I agree that if there’s a standard that works, we should prefer to use it but I must admit I’m struggling a bit with understanding the SoftwareSourceCode spec. It’s the first time I’ve come across JSON-LD, so please bear with me.

Looking at the spec, I can see that author is an Author or an Organization. My first thought is that doesn’t cover a lot of scenarios as open source code rarely has just one author, and is rarely an organisation.

However, it strikes me that author might possibly be an array of Author or Organisation data, because looking at the spec for keywords it is defined as Text and yet the example above includes an array of text. Can anything in the spec be expressed as an array? The spec is very unclear on that issue if that’s the case, but if it’s not I don’t see how keywords is working.

What I’d love to do is get the schema into something I can play with and write a bit of data and see what validates, and what doesn’t. Is there a way I can easily get something like that? I found the JSON-LD Playground but it didn’t seem to validate anything. I took their example of a Person and deliberately changed the field names, and it seemed fine with it. I’m almost certainly doing it wrong, but any help would be appreciated.

mattt · June 27, 2020, 6:54pm

Correct. Here's the relevant JSON-LD documentation from W3C:

6.11 Sets and Lists

This section is non-normative.

A JSON-LD author can express multiple values in a compact way by using arrays. Since graphs do not describe ordering for links between nodes, arrays in JSON-LD do not provide an ordering of the contained elements by default. This is exactly the opposite from regular JSON arrays, which are ordered by default. For example, consider the following simple document:

EXAMPLE 42: Multiple values with no inherent order

{ ... "@id": "http://example.org/people#joebob", "nick": [ "joe", "bob", "JB" ], ... }

The example shown above would result in the following data being generated, each relating the node to an individual value, with no inherent order:

Subject Property Value

http://example.org/people#joebob FOAF Vocabulary Specification joe

http://example.org/people#joebob FOAF Vocabulary Specification bob

http://example.org/people#joebob FOAF Vocabulary Specification JB

I think that last table is a good illustration of how linked data works. Each (non-@-prefixed) line in a JSON-LD object represents a fact, comprising a subject, predicate, and object. When you provide multiple values for, say, a project's author field, that expands into a separate fact for each value in the array.

Apologies, it looks like my pasted example had a couple of syntax errors (extra trailing comma and missing trailing slash in the schema.org url). After making those corrections, it validates in the playground:

daveverwer · June 27, 2020, 10:30pm

Thank you Mattt, that's very helpful.

I also found a better validator, which checks key names as well as basic structure: Schema Markup Testing Tool | Google Search Central | Google Developers

I'm genuinely really torn on the issue of whether following this schema is the best way to approach this problem.

Fundamentally it seems like a good idea, and I do not particularly want to make a new standard, but the huge number of keys and flexibility in this spec does have drawbacks.

It’s easy to look at a spec like this and say that it’s the right decision because it’s a standard, and it supports every possible piece of metadata that might ever be needed. However, we should not underestimate the problems both package authors will have getting this data correct, and the problems tools (like the Swift Package Index) will have picking which of the fields to pay attention to.

For example, do we use name when we report on a Person listed as the author, or some combination of givenName, familyName, additionalName, or middleName? All are valid. The schema is so huge, and so much more than is needed that it feels like it’s going to make both filling it in, and sensibly parsing it really difficult.

There is also the significant issue of JSON being quite difficult for humans to read and write. The lack of support for comments, the poor readability/writability of multiline strings. At least initially, this file is going to be manually created by humans, and the scope for getting it wrong is huge. If it’s too hard to fill in, and the benefits are not clear, package authors won’t bother filling it in.

If we do go down the path of using this schema, I think the approach I'll take on the SPI site is to make a documentation page explicitly which subset of the spec we'll pay attention to, and give examples. There’s a chance that a subset of this spec may become well adopted, and at that point if people want to add more data into their metadata files, that’s OK.

I look at this schema, and then think back to what I was imagining this would be - a fairly simple YAML file with just a few, focused keys. It just feels so much more user friendly and so much less error prone.

I’m genuinely torn on the issue.

mattpolzin · June 28, 2020, 3:47pm

These strike me as really good points. Making a standard broadly applicable/extensible and making a standard easily accessible do feel like competing goals sometimes. I think that’s true whether it’s the standard to describe all software source code or just the standard to describe metadata specifically relevant to Swift packages.

I think it’s worth the exercise we have started to go through of (re-)inventing the design plans for a wheel before tearing off and building a factory to build our new wheel or telling everyone to use an existing wheel design. Apologies for even trying to make that analogy work.

My point is, if in the process of designing our own spec we notice that we need to leave room for lots of parts of it to be extensible or alternatively parts of it feel awkwardly restrictive then maybe we need a complex pre-existing standard. On the other hand, maybe the problem of describing Swift Package Metadata simply is not as complex as the problem of describing an arbitrary body of source code was determined to be by the authors of the SoftwareSourceCode specification. On the other other hand, maybe our design starts to have enough in common with an existing standard that it would be silly to use our own new standard to no real benefit.

Max_Desiatov · June 28, 2020, 4:46pm

I hope that both the "new standard" and SoftwareSourceCode are not mutually exclusive as long as we make sure that one can unambigously transform an arbitrary package manifest to some structured data compatible with what schema.org declares. That is, the latter would be a low-lever representation of the former. People write their human-readable manifest declarations, while dev tools convert those into JSON-LD where needed. E.g. SPI (or any other dev tool or index for that matter) could generate this structured data from manifest declarations to make its pages indexable in a better way by search engines and voice assistants.

finestructure · June 28, 2020, 6:56pm

I feel exactly the same, @mattpolzin: On the one hand I think it's importat to look beyond the immediate need and schema.org clearly has a lot to offer. However, when trying to apply it to the current situation it seems like quite a bit of work even when you're coming at it with best intentions.

I fear that users who'd benefit from filling this out won't be as invested and just see it as a burden.

I suppose there could be ways to make it easier to edit. But introducing tooling around something that on the other hand could be as simple as a text file in any editor also feels weird.

What sort of illustrates the point - and I don't mean this to come across as snarky - is that even Mattt had validation errors in his initial sample. That makes me wonder: how many people will get this right on their first try so that we can parse it in the Swift Package Index?

daveverwer · June 28, 2020, 7:36pm

I'd completely agree with you here @Max_Desiatov in that I'd fully support the SPI, or the package registries (as proposed by GitHub and @mattt) taking the simpler format and turning it into a formal JSON-LD file.

That seems like a great solution to me. A simple metadata file, probably in YML, with good comments, (for the reasons discussed here) that package authors complete, and turn it into a more structured format.

@mattt does that fit well with what you saw for the package registry spec?

mattt · June 29, 2020, 12:49pm

The initial package registry proposal doesn't spend much time discussing the use of JSON-LD, but it's something I'd like to develop further for the next iteration.

I think this is fundamentally a skill gap and tooling issue. As a community, we aren't familiar with the semantic web or its tech stack. And without an appreciation of the value of encoding information in a knowledge graph, it can feel like overkill compared to a list of key-value pairs.

There are a few different ways to close this gap:

Intermediate representation, as @Max_Desiatov mentioned
Tooling, such as:
- An interactive wizard (similar to npm init)
- A validator
- A Swift library or Playground
GUI, such as a web form
Documentation

Of these, I think an intermediate representation has a weaker cost/benefit analysis than some of the alternatives, but yes — it could certainly work. If you do go that route, I'd recommend looking at JSON schema or CUE for examples of how to define and validate this structure.

The question is whether the complexity of something like SoftwareSourceCode is accidental or essential. A few folks have called out the standard as being complex, which is a fair point. However, if that complexity is essential, then you'll find that a solution that starts off as simple will become more complex over time as it fills in the missing pieces.

Validation is something you'll need to do no matter how you encode project information. Better to have an explicit schema and errors than implicit parsing behavior and silent failure. The benefit of plugging into JSON-LD is that it encodes strong, semantic guarantees about the structure of information and comes with existing tooling.

mattpolzin · June 29, 2020, 2:23pm

Structural validation is great, and JSON Schema is relatively easy to write and tools that apply it are readily available... But I can’t shake the feeling that we (as people trying to document something about Swift packages) are fighting a strange battle if we want to use something other than Swift code to represent metadata because using Swift introduces unwarranted dependencies but using something other than Swift makes us want to introduce several layers of structural and relational guarantees on top of JSON anyway. Aren’t we finding ways to add things to JSON that Swift has out of box? Isn’t the advantage of JSON over Swift exactly the fact that we trade safety for accessibility?

Karl · June 29, 2020, 2:50pm

It doesn’t matter. The goal shouldn’t be to make a political point about which language is the best - the goal is to make the information as widely available and easy to interpret as possible.

I really hope that Swift doesn’t do its own thing just for the sake of it. AFAICT nobody can articulate a reason why JSON-LD is bad or in some way does not meet our requirements: it’s just ugly and they don’t like it. At the same time, it’s curious to watch those people bemoan how hard it is for users, while they themselves are instantly able to find support resources and tooling on the web.

mattpolzin · June 29, 2020, 2:53pm

I was trying to make a practical point, not a political point.

mattpolzin · June 29, 2020, 3:06pm

Admittedly, I was momentarily forgetting about how useful it is for non-Swift based tools or websites to be able to work with this metadata because I was a bit too focused on the fact that Swift is clearly available to those writing the metadata and its available to the Swift Package Index. Although, since it’s not hard to dump a Swift structure to JSON, maybe it still isn’t crazy to think that we would write the metadata as a Swift structure.

mattt · June 29, 2020, 3:30pm

It's important to distinguish JSON and other data interchange formats from Swift, which is a programming language.

JSON encodes data. Ignoring any implementation differences among JSON parsers, the data you encode is the data you decode every time. It's static, structured data at rest.

You can encode information in Swift code, but decoding that information requires a Swift compiler. Nearly all systems can decode JSON in-process, but fork–exec-ing swiftc is a nontrivial runtime dependency for production systems.

But even setting that aside, there's a fundamental problem of not being able to guarantee constant evaluating of a Swift file. For example, consider this hypothetical Package.swift file proposed by @cukr in the registry proposal thread:

By the same token, you could imagine a malicious package adding long sleep() statements or performing file system operations.

@daveverwer could decide to use Swift as a sort of intermediate representation for JSON-LD, but he'd have to weigh the potential benefit of editor support and semantic type-safety with these operational and security considerations.

Max_Desiatov · June 29, 2020, 4:42pm

Exactly this, that's the reason why I think that selecting Swift as a source language for package manifests was a mistake. It may seem to be powerful enough for some use cases, but just try to parse it in any environment where Swift itself is not installed and you'll understand the pain. It also doesn't have to be as malicious as sleep() or random(). Swift allows recursion, thus nothing prevents you from introducing an infinite loop in your Package.swift by accident, which is going to break package resolution for all packages that depend on it, and there's no way to statically diagnose it.

I hope we don't end up with this problem when trying to describe metadata. If there's a need to use something more powerful than plain JSON or YAML to describe this data, one could use languages specifically designed for that, such as Dhall, which in addition to not allowing recursion, allowing comments, and also supporting both JSON and YAML as end representations, is also statically typed. I realize though that the mainstream opinion is to have "something popular and simple", but I hope that something like Dhall could be considered for similar use cases in the future.

jayton · July 2, 2020, 11:19am

Just so that this is addressed: JSON-LD is bad because JSON doesn’t support comments and is thus inappropriate for hand-maintained files. Regardless of what tooling is available, if JSON-LD is the canonical format, it will be hand-maintained in many projects.

That said, I don’t intend to argue that any other option is clearly less bad than JSON-LD.

Subject	Property	Value
http://example.org/people#joebob	FOAF Vocabulary Specification	joe
http://example.org/people#joebob	FOAF Vocabulary Specification	bob
http://example.org/people#joebob	FOAF Vocabulary Specification	JB