[Re-review] SE-0274: Concise Magic File Names

Ben_Cohen · January 28, 2020, 5:46pm

The re-review of SE-0274 — Concise Magic File Names begins now and runs through February 4, 2020.

The original review thread concluded with the core team accepting the proposal in principal: #file will be updated to include only the file and module names, with #filePath introduced for use cases that require the full path to the file.

However, during the review it was proposed that the exact format of #file be specified, in order to allow for future tooling to rely on that format to determine the file and module name from the string.

The proposal has been amended to explicitly include this format, and this re-review is focused on that aspect of the proposal. A diff between the previous version of the proposal and the current one can be found here.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager (via email or direct message in the Swift forums).

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thanks,
Ben Cohen
Review Manager

benrimmington · January 28, 2020, 7:49pm

file-string → module-name "/" file-name
file-string → module-name "/" disambiguator "/" file-name

If the Swift compiler allows a module to contain two source files with the same file-name component, for example:

"/Users/brent/Desktop/0274-magic-file.swift"
"/Volumes/SailorSwift/0274-magic-file.swift"

then each disambiguator could include just enough path components (starting from the end of the path) to disambiguate the #file strings:

"MagicFile" + "/" + "Desktop"     + "/" + "0274-magic-file.swift"
"MagicFile" + "/" + "SailorSwift" + "/" + "0274-magic-file.swift"

If the Core Team are concerned about privacy, then the disambiguator could be a SHA-1 checksum prefix of the path components, as suggested by Brent in the previous review.

"MagicFile" + "/" + "532c67f" + "/" + "0274-magic-file.swift"
"MagicFile" + "/" + "ae4e52f" + "/" + "0274-magic-file.swift"

gribozavr · January 30, 2020, 10:47am

What is your evaluation of the proposal?

+0.5. TL;DR:

I don't quite like having #filePath available at all, because even one usage of #filePath in your dependencies is sufficient to leak information about the build machine, and to make builds non-reproducible. There is not enough motivation in the proposal to understand why it is being added, so I can't propose a better alternative.
I think the "#moduleName + #fileName" alternative is better because it provides the information that people want directly (people said that they will want to parse out the module name), and this option will allow better deduplication of strings.

Long analysis follows.

The issues with information leakage, and unnecessary differences between binaries built in different ways are only addressed partially. To fully address them, we should not be adding #filePath, or at least, allow to instruct the compiler the compiler to make #filePath stable, for example, make it equal to #file. This issue is somewhat mitigated by the fact that #filePath is not forced to be an absolute path (it passes through whatever was provided on the command line), so as long as the build system is careful to not use absolute paths either, it should be possible to build code reproducibly. However, such an arrangement is fragile.

The availability of #filePath itself is an issue -- if someone in your dependency graph uses it, your builds can become non-reproducible and can leak information. Does everyone audit all of the open source code that you depend on? I think not. So I don't think compatibility would be a good reason to add #filePath, because we don't want users to just replace #file with #filePath to get the previous behavior. Generally, the proposal does not provide enough motivation for #filePath (it says "For those applications which still need a full path, we will provide a new magic identifier, #filePath" -- I would prefer to see the actual use cases listed).

The issue with bloated binaries is also not fully addressed in this proposal, because each of the #file strings will contain the module name. While all #file strings coming from one file can be deduplicated, module names across #file strings coming from multiple files can't be deduplicated, so we would be leaving some code size improvements on the table. We could enable such deduplication by adopting the #fileName + #moduleName alternative.

What the proposal says about #fileName is true -- it is not a useful string by itself, it should always be combined with #moduleName to be non-ambiguous. However, that can (and should) be done at the API level, to allow the compiler and the linker to deduplicate strings.

ABI breakage for existing clients, like fatalError(_:file:line:) is unfortunate, but there are ways to deal with it (add a new entry point, and keep the old one for old clients).

The proposal argues that file: String = "\(#moduleName)/\(#fileName)" gives us the caller's location, and uses it as one of justifications for not introducing #fileName. However, this behavior looks like a bug. Compare that default argument expression with file: String = #file, which gives us the callee's location. What we see is that the default expressions involving #file are sometimes evaluated at the caller's location, and sometimes at the callee's location -- looks like a bug to me. Once that bug is fixed (so that all default argument expressions are evaluated with the caller's source location), clients that don't want to break the ABI would be able to combine the strings in the default argument if they like.

Another reason to add separate #moduleName and #fileName is to address the use case of people who said that they want to parse #file. Sure, we can document how to parse it correctly in a forward-compatible way -- but we could instead provide the information that people want directly.

The only advantage of introducing a combined #file instead of #moduleName + #fileName that I see, is having a standardized way to compose and communicate concise file paths. If we introduce #moduleName + #fileName, different users could end up combining them differently, making it more difficult for tooling to parse them out of logs and other places. However, as long as the format of the pseudo-filename that can be recognized by tooling is clearly documented, I think most users will adopt it.

If we must introduce #file, then I fully agree with @tkremenek's point about having to specify the format of the string. If we don't formally specify the format of the string, while providing a stable format in practice, users will still come to expect the behavior that is implemented. It is a trivial application of Hyrum's Law. I'm not even sure that it would be possible to exercise the the flexibility to add a disambiguator in future without breaking code. I think some code will be broken no matter what, because it will not expect a disambiguator.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. Of the reasons listed by the proposal, binary size due to embedding long paths is a serious issue for big libraries and applications, which will become a hurdle for larger-scale adoption of Swift. Reproducible builds are also a necessity for many large-scale deployment, which often have technical reasons to prefer reproducible, hermetic builds (simpler mental model, easier to debug build system issues, enables caching of artifacts), and also non-technical reasons, like internal and external auditability and compliance requirements.

Does this proposal fit well with the feel and direction of Swift?

Yes, magic identifiers so far have been the way to express these sort of features.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

C and C++ have a magic filename macro, and it has been causing issues that the proposal mentions. To mitigate them, Clang allows to redefine the macro on the command line (-D__FILE__="example.cpp"). That requires the user to do their own plumbing in the build system, which is not nice, and few people care to do it. This proposal is different in that it tries to make the defaults good.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I am deeply familiar with the issues that the proposal lists as motivation.

John_McCall · February 5, 2020, 8:23pm

Sorry, I'm just now catching up with this thread.

If people want #moduleName and #fileName on their own right, I think that's a reasonable request. However, the deduplication / binary-size argument for separating these strings is almost certainly misguided. File name strings are likely to be unique within a module, so the expected binary-size benefit here is just that each of those string literal will not include a MyModule/ prefix, which for an M-byte module name is at most an M+1-byte win per file in the module. This is not an insignificant improvement, but it has an associated cost: every call site must materialize an extra argument by loading the address of a string literal into a register. On x86-64, leaq OFFSET(%rip), REG is a 7-byte instruction, and on ARM64, adrp/add is an 8-byte instruction sequence; we can assume it'll generally run in that range across architectures, although it can be larger or smaller depending on the code model. That is a per-call-site cost, not a per-file cost. If there are N paired uses of #module and #fileName on average in files that use them at all, then on x86-64 the module name would need to be 7N bytes long before separating the strings showed a benefit over #file. I haven't tried to measure the value of N across all known code, but I think it's very unlikely to be low enough to make this work.

gribozavr · February 5, 2020, 9:17pm

Just to provide extra context, when we build Swift with Bazel, the module names are generated automatically from Bazel package names, and can be quite long (in the codebases that I work in, most module names have at least 40 characters, and 100+ character module names are common). Such long package names are unfortunate, but are absolutely necessary in large codebases (otherwise every team would make a module called "Util", which would conflict with each other when linked into one app).

Bazel package names, for the most part, repeat the directory name of the file. Therefore, for Swift codebases built with Bazel, this proposal is changing #file from path/to/my/file.swift to path_to_my/file.swift, which does not help much with code size.

I didn't think about the code size cost, and it is indeed pretty unfortunate. These instructions also have a chance to pollute the instruction cache (as compared to the cold string data that will likely stay out of the cache for the whole duration of program execution). With that, I agree that going with the #moduleName + #fileName option right now would create extra costs.

If we were to holistically fix the issue with code and data size, I think a better fix would probably combine all source location information (module, file, line) into one structure in read-only data, and deduplicate strings across them. Functions that need to pass around source location information will pass pointers to those structs. Doing so requires to materialize only one pointer (compared to a filename + line number today, or with this proposal). However, introducing that mechanism would be a different proposal.

Therefore, I would like to revise my comment about this proposal not solving the binary size issue -- it solves it as well as changing magic constants can solve it. Better solutions are possible, but would require a bigger change.

John_McCall · February 5, 2020, 9:36pm

I think you need to split the line number out of that structure, or else you're still eating a substantial call-site-specific cost — unless you see a lot of emissions with the exact same source location, which I think is unlikely. But it's true that you could optimize (#moduleName, #fileName) into a single structure with two relative addresses, and that would cut the overhead from M+1 bytes per file down to a constant 8 bytes.

I understand the Util problem, but there's almost certainly a less redundant namespace that Bazel could use as a module prefix than the full path to the package. Do you really expect users to write imports of 100-character module names?

gribozavr · February 5, 2020, 9:51pm

Such a name is not easily available right now. Deriving it would require global analysis over the codebase, which is always difficult, and can create spooky action at a distance when some short name suddenly becomes non-unique due to an introduction of a new module somewhere in the source tree.

IDEs write them for users. It also makes an intuitive model for large codebases: if you are looking at a Swift file whose APIs you want to use, you can mechanically transform the file path into a module name that you should import without the need to consult the build system configuration.

John_McCall · February 5, 2020, 9:58pm

Well, I'll take it as a given here, but you will certainly get a better tooling experience if you can just namespace it by one of the org, team, or supported project instead of apparently all three.

gribozavr · February 5, 2020, 10:24pm

I'm not sure why we will get a better tooling experience. The names will be more convenient for humans (in some use cases), but that's about all I can think of. We could get a higher tooling maintenance cost if we would need to have an up-to-date table that maps a module name back to the path that contains its sources, so that we could easily find them when the Swift module file is not available. With package-based module names such mapping is trivial.

Also, the paths don't necessarily reflect the organizational structure as clearly as that. Requiring a human to decide on an org-, team-, or project-specific prefix could work if humans were perfect, but they are not. Project names can be reused, they can collide across distant teams, and project/team boundaries can be fuzzy so it can be difficult to find a decider and make them do the right thing for all stakeholders.

However, the code has to go into some directory, the engineer has to decide on the directory in any case. Turning the name of that directory into a module name just works, requires no human judgement, and does not allow for human error. It is not the most aesthetically pleasing solution, but it is a pragmatic one for big codebases.

allevato · February 5, 2020, 11:15pm

And Bazel is actually the lucky case here, because all of our code is in a monorepo where we have unique paths from which to derive module names automatically, and we can easily do large-scale changes across multiple projects if something is broken or needs to migrate.

I'm more worried about the scalability of Swift (in terms of projects involving large number of modules) in the multi-repo case, like SwiftPM users. But it's easy to imagine someone putting together a project P that depends on two totally unrelated libraries, A and B. Maybe A starts out with a module somewhere in its dependency graph named Utils. Then B releases a critical update that happens to add its own dependency to a completely different module that's also named Utils. Project P simply cannot do this upgrade because both Utils modules can't be in the dependency graph. And unlike a monorepo, the author can't simply send a patch to all the projects involved and fix the conflict.

The solution so far seems to be to prefix your module names if you're building a library package (see TSCBasic, TSCUtility, etc. in swift-tools-support-core), but that's not scalable and relies on humans to do the right thing. I think implementation-only imports would also help by cutting off how far the modules need to propagate, but that also relies on humans to use the right imports, and it won't work if those modules can't be implementation-only imported (if they need to be used in upstream exported APIs).

I don't know what a good general solution is right now, but as SwiftPM usage increases and people start composing projects from more finely-grained modules (vs. the traditional Xcode model of "everything is in your app target plus maybe one or two frameworks"), more users are going to hit the issue that we've already predicted and tried to avoid in Bazel, by taking the human factor out of the equation. Our solution is ugly, but it works for the most part.

Ben_Cohen · February 26, 2020, 9:30pm

Review Conclusion

The second review of this proposal has ended and the proposal is accepted.