[Proposal] SOAR-0002 Improved naming of content types

georgebarnett · August 7, 2023, 9:42am

The Swift OpenAPI Generator improvement proposal SOAR-0002: Improved naming of content types is now In Review.

The review period will run until 14th August. Please reply to this thread or on the pull request with any feedback.

Honza_Dvorsky · August 7, 2023, 11:01am

Here's a direct link for easier reading of the proposal: https://github.com/czechboy0/swift-openapi-generator/blob/hd-soar-0002-content-type-naming/Sources/swift-openapi-generator/Documentation.docc/Proposals/SOAR-0002.md

georgebarnett · August 7, 2023, 4:30pm

These specific values were not chosen arbitrarily, instead I wrote a script that collected and processed about 1200 OpenAPI documents from the wild, and aggregated usage statistics. These content types, in this order, were the top used content types from those documents.

Could you share more info about the aggregate statistics? What percentage of all content-types that you discovered did the 7 chosen ones account for? What percentage did the 7th and 8th most used account for?

Having "short" names also feels slightly at odds with the principle to "Faithfully represent the OpenAPI document".

Honza_Dvorsky · August 7, 2023, 6:10pm

Certainly. Out of the 1192 OpenAPI documents sampled, the following number had the following content type included at least once. Note that a content type is only counted once in an OpenAPI document, so even a doc with 10 occurrences of a content type still counts as 1.

application/json: 1073 (90%)
application/x-www-form-urlencoded: 96 (8%)
multipart/form-data: 76 (6%)
text/plain: 75 (6%)
*/*: 60 (5%)
application/xml: 49 (4%)
application/octet-stream: 39 (3%)

That's where I drew the line, the next content types were quickly dropping off, with types like text/html, application/yaml, text/csv, image/png, application/pdf, image/jpeg, all around 10 each. Then the long tail continued for a total of 492 content types.

Let me talk about each of the content types and why they deserve a special treatment:

application/json - the most important content type in REST services, with 90% clearly the most popular one by far
application/x-www-form-urlencoded, multipart/form-data, and application/xml - structured content types that the OpenAPI specification documents enough that, while we don't today, we could generate type-safe types for in the future, just like we do for JSON today
text/plain - commonly used to send unstructured text data, like logs, so deserves to be deserialized into the native Swift container: Swift.String
*/* - also explicitly called out in the OpenAPI specification, however we're still figuring out if/how we'd generate special code for it, but it's clearly popular; and its long name is not very nice (see below)
application/octet-stream - also specially called out in the OpenAPI specification as the default raw bytes content type; its long name also isn't super beginner-friendly, "octet stream" is the first term developers use when talking about raw data

That's in contrast to the content types I left under the line, like text/html, image/png, and application/pdf, which the OpenAPI specification doesn't document any structure for, so it's unlikely we'll ever try to introspect; instead, we'll continue to pass the raw bytes to the adopter's code to handle however they like. So it seemed like a natural point to draw the line, as we only do the work of coming up with short names for 7/492 = ~1.4% of content types, but they still cover the vast majority of use cases.

That's a fair interpretation, but let me compare and contrast the two options we're deciding between here.

Content type	Proposed short name	Long name
`application/json`	`json`	`application_sol_json`
`application/x-www-form-urlencoded`	`form`	`application_sol_x_hyphen_www_hyphen_form_hyphen_urlencoded`
`multipart/form-data`	`multipart`	`multipart_sol_form_hyphen_data`
`text/plain`	`text`	`text_sol_plain`
`/`	`any`	`_ast__sol__ast_`
`application/xml`	`xml`	`application_sol_xml`
`application/octet-stream`	`binary`	`application_sol_octet_hyphen_stream`

I think the short names on the left actually represent the intent of the OpenAPI author better than the names on the right. In a world where content types were a closed set, we could come up with a short name for every content type, but since that's not the case, we have to draw a line somewhere between the frequently used content types that deserve pretty names, and all other content types, which we stringify using a scheme that results in as few conflicts as possible while still being readable (even if not pretty), recently updated in SOAR-0001.

If we don't use this split approach, we either would have to write short names for all content types, which we can't (as mentioned, it's an open set), or we don't write short names for any content types, which is more consistent, but I feel it'd be sacrificing readability and user ergonomics for little benefit, since the extra complexity of the short names is negligible. The split logic seems justified by the fact that 90%+ of adopters will see these identifiers all over their request/response bodies, and have to spell them by hand when sending requests and unwrapping received responses.

I do wish we could apply a simple rule one way or another, but alas I don't think that'd serve the adopters best.

That said, I'm more than happy to discuss where the line is drawn, and what the exact spelling of the short names should be. I came up with these very unscientifically, by just trying to pick the name I most commonly hear developers use when referring to these content types.

I'm particularly not very happy with form and any, so would like some ideas on those, the rest I think are okay; but again - feedback on any of them is very welcome.

beaumont · August 7, 2023, 7:04pm

I'm inclined to agree: these short names lose quite a bit of resolution.

I don't think that anyone would argue that application_sol_json is a good name for application/json. However, looking at RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, it doesn't appear we need to be as defensive in the name mapping as we are for OpenAPI identifiers, and I would venture that we could probably shoot for something like application_json pretty safely—that is: we could replace the slash that separates the type frome the subtype wtih an underscore (cf. _sol_).

That said, I think there's some precedent for some shorter names: https://github.com/vapor/core/blob/main/Sources/Core/MediaType.swift.

Looking at the link above, they seemed content with any and form-data.

My biggest gripe is the use of multipart for multipart/form-data which seems to stand out as being the only one "squatting" on a top-level type for its name. What would we call other multipart/<subtype> types.

While I'm still not sold on short names, I think that they need to attempt to convey the specificity of the full content-type.

Similar with text, I'd be happier with plaintext.

Honza_Dvorsky · August 7, 2023, 8:36pm

Yeah, those sound better: formData, plainText, urlEncodedForm. That way, none of them squat their top level type. I'll update to those names in a revision of my proposal - thanks!

Honza_Dvorsky · August 7, 2023, 9:04pm

I'll also investigate getting rid of the _sol_ part by safing each component of the MIME type separately, and concatenating them with an underscore. Also to come in the next revision.

georgebarnett · August 8, 2023, 7:36am

Interesting – thanks for sharing your data!

I understand your point entirely: providing short names for all media types is impossible. As you also noted the fallback names are not the most readable.

The concern I have is the transition between the two, while 3% sounds pretty low, out of 1000 OpenAPI documents it's still 30 documents, and each document may use these types multiple times. It's also a bit surprising to me that, what I would consider to be common, types like text/html aren't covered by the short names.

I think we should:

Draw the line further down so that more media types have short names
Evaluate whether the long names can be made simpler (although this might not be necessary if we special case more media types)

Honza_Dvorsky · August 8, 2023, 7:52am

Sounds good. Here's the continued list with proposed short names, suggestions for better names are welcome:

Content type	Number of occurrences	Proposed short name
`text/json`	21	None, I think folks are using this by mistake, and mean `application/json`?
`text/html`	19	`html`
`application/yaml`	14	`yaml`
`text/csv`	14	`csv`
`text/xml`	14	None, same as `text/json`?
`image/png`	13	`png`
`application/pdf`	11	`pdf`
`image/jpeg`	10	`jpeg`

Under 10 the long tail of various custom application/vnd.* types start and quickly approach the number of occurrences of 1. So I think stopping at the number of occurrences 10 is reasonable.

I'm investigating the idea @beaumont proposed above, where we actually safe the type and subtype separately, and concatenate them with an underscore, so for foo/bar we'd end up with foo_bar instead of foo_sol_bar, which I hope is enough of an improvement to work for the long tail.

Honza_Dvorsky · August 8, 2023, 9:19am

Hi everyone,

thanks for the feedback so far. I just pushed v2 of the proposal.

Diff from v1 to v2 is in this commit: SOAR-0002: Improved naming of content types by czechboy0 · Pull Request #170 · apple/swift-openapi-generator · GitHub

It aims to address the two main points of feedback:

which content types have a short name, and what their spelling is
how the long/generic/fallback names are computed

You can find the current rendered version of the proposal here, and it contains a versions section describing what changes were made. Also, inline in the proposal, I highlighted what changed from v1 to v2.

cc @beaumont @georgebarnett

georgebarnett · August 8, 2023, 10:01am

One minor naming note:

application/x-www-form-urlencoded maps to urlEncodedForm
multipart/form-data maps to formData

Should urlEncodedForm be urlEncodedFormData for consistency with formData?

Honza_Dvorsky · August 8, 2023, 1:18pm

I'm not sure, as the word "data" doesn't appear anywhere in application/x-www-form-urlencoded.

Looking around more, I think a lot of people use the term "multipart" and know what it means, I wonder if we could borrow the short name multipartForm from Hummingbird for multipart/form-data?

That way, we'd have:

application/x-www-form-urlencoded -> urlEncodedForm
multipart/form-data -> multipartForm

That achieves that consistency that I think you were looking for, as these are just "two kinds of forms", and IMO is closer to the terms used for these content types day to day. WDYT?

georgebarnett · August 8, 2023, 1:56pm

Works for me.

Honza_Dvorsky · August 8, 2023, 2:22pm

Thanks @georgebarnett.

Ok, here's v3 of the proposal, diff from v2: SOAR-0002: Improved naming of content types by czechboy0 · Pull Request #170 · apple/swift-openapi-generator · GitHub

Latest rendered version: https://github.com/czechboy0/swift-openapi-generator/blob/hd-soar-0002-content-type-naming/Sources/swift-openapi-generator/Documentation.docc/Proposals/SOAR-0002.md

beaumont · August 9, 2023, 10:13am

Thanks for making the amendments so far @Honza_Dvorsky!

While the idealist in me doesn't love having to pick an arbitrary cutoff for short names, nor a solution that doesn't generalise to all names, the proposal as-is adds value so +1 from me.

georgebarnett · August 14, 2023, 9:33am

The review period has now ended. Feedback broadly fell into two areas:

How the names of content types are generated.
The number of content types with short names.

Feedback has converged and v3 of the proposal is accepted. SOAR-0002 is now Ready for Implementation.

Honza_Dvorsky · August 14, 2023, 12:45pm

Thanks @georgebarnett.

Ok, the change landed in main now behind the multipleContentTypes feature flag.