SE-0274: Concise Magic File Names

  • What is your evaluation of the proposal?

+0 for the #file change.
+1 for adding #filePath
-1 for #filePath being the variant 1 instead of variant 3.

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes.

  • Does this proposal fit well with the feel and direction of Swift?

Yes.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

As proposed: I think it is about as good as other implementations.
With #filePath being (optionally) relative to a source root declared on compiler invocation (e.g. by SwiftPM): Better.
I'd like to (additionally) see #filename (just the name) and #modulename (just the module).

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Quick reading, followed original discussion.

1 Like

Oh, I don't think I quite understood before, but that's a really interesting idea! So if I just say "swift foo.swift," it could map those strings back to full paths when executing the file. Same for the "swift run" and "swift test" commands of SwiftPM. Finally, I think we might want something like

swift exec someExecutable <rest of command line that built someExecutable>

for executing things built outside of SwiftPM.

That depends on your concerns - but imho there's at least a big difference:
I have no idea what harm somebody could do with a name of some files, but as the proposal mentions, the full path can easily reveal the name of the user who build the binary.
This information might already be a small help for a attack (afaik there are still people trying brute force attacks with common logins like "root" or "admin").

Additionally, it might be possible to gain information that can be used for social hacking... of course, I guess in most setups, it would only make IT laugh when somebody claims to be a colleague of Mr. Jenkins ;-), but who knows?

1 Like
file-string → module-name "/"  file-name

@beccadax The / separator is a good idea, but another option is the :: separator from apple/swift#28834.

Fatal error: <message>: file MagicFile::0274-magic-file.swift, line 3
1 Like

There are a couple problems with :: as the separator:

  1. It's a little weird to introduce this syntax in the #file string before we introduce it in the language itself.

  2. It's possible to have : in a filename on some platforms. This means that only one component of file-string can be a path component, which limits what we could do with a disambiguator field in the future.

  3. Parsing the :: separator in Swift code would be awkward because String doesn't currently have searching or splitting calls that work on more than a single character. You'd need to import Foundation to do it. (Or wait for the "matchers" proposal or something similar to develop and become part of the language.)


However, this did give me an interesting idea that also solves the same-file-name issue. Consider this:

file-string → module-name ":" partial-sha ":" file-name

partial-sha → [0...64 characters from ASCII "0"..."9" and "a"..."f"]

partial-sha is a prefix of the hex representation of the #filePath string's SHA-1 checksum. (Why SHA-1 instead of something more secure? It's not being used for security here and most programming environments, including llvm/Support, already implement it.) The compiler adds only enough hexits to distinguish between two otherwise identical strings. So if you don't have any overlaps in filenames, you end up with ::, but if you do, there are one or more characters between them.

Swift implementation (quick-and-dirty, requires CryptoKit)
import Foundation
import CryptoKit

extension Digest {
  var hex: String {
    let str = String(lazy.flatMap { String(format: "%02x", $0) })
    assert(str.count == Self.byteCount * 2)
    return str
  }
}

/// - Returns: A dictionary with `#file` string keys and `#filePath` string values.
func fileStrings(forFilePaths filePaths: Set<String>, inModule module: String) -> [String: String] {
  struct FileStringCandidate {
    init(path: String) {
      self.path = path
      self.pathSHA = Insecure.SHA1.hash(data: path.data(using: .utf8)!).hex
    }

    let path: String
    let pathSHA: String

    var name: String { URL(fileURLWithPath: path).lastPathComponent }

    func partialPathSHA(withLength length: Int) -> Substring {
      assert(length <= pathSHA.count, "could not unique #file strings even with full SHA")
      return pathSHA.prefix(length)
    }

    func fileStringFragment(withSHALength length: Int) -> String {
      "\(partialPathSHA(withLength: length)):\(name)"
    }
  }

  /// The #file-to-#filePath map for all fully computed strings.
  var finished: [String: String] = [:]

  /// Current length of new entries into the working dictionary.
  var workingPartialSHALength = 0

  /// Holds candidates that we are not sure we've finished working on.
  var working: [String: [FileStringCandidate]] = [:]

  /// Adds candidates to `working` under keys with `workingPartialSHALength` SHA characters.
  func scheduleUnderNewKeys(_ candidates: [FileStringCandidate]) {
    assert(workingPartialSHALength <= 64)
    working.merge(
      candidates.map {
        ($0.fileStringFragment(withSHALength: workingPartialSHALength), [$0])
      },
      uniquingKeysWith: +
    )
  }

  scheduleUnderNewKeys(filePaths.map(FileStringCandidate.init(path:)))

  while !working.isEmpty {
    workingPartialSHALength += 1

    for (fileStringFragment, candidates) in working {
      // In both cases, we remove the old entry.
      working[fileStringFragment] = nil

      if candidates.count == 1 {
        // Move this to the finished list.
        finished["\(module):\(fileStringFragment)"] = candidates.first!.path
      }
      else {
        // Re-add candidates with a longer SHA.
        scheduleUnderNewKeys(candidates)
      }
    }
  }

  assert(finished.count == filePaths.count, "lost some #filePath strings along the way")

  return finished
}

print(fileStrings(forFilePaths: ["/src/a.swift", "/src/b.swift", "/src/c.swift", "/tmp/c.swift"], inModule: "Foo"))

For instance, module Foo with these four files would get:

#filePath string #file string SHA-1(#filePath string)
/src/a.swift Foo::a.swift b8a19407a17b4607665958d1f5f47b4c80445df2
/src/b.swift Foo::b.swift 18b2dcdcb10c918b4318b27096df8c285abf4d29
/src/c.swift Foo:c:c.swift cfc12f3697ad6853e86d3517fade1dff1dc371ac
/tmp/c.swift Foo:8:c.swift 8c80e2fed5328e154bd8e14ae3d1be1c85335abc

If the first character of /tmp/c.swift's SHA had also been "c", the two files would have ended up with the #file strings Foo:cf:c.swift and Foo:cc:c.swift, respectively. In theory, the compiler could go all the way to 40 characters, although that's vanishingly unlikely to ever happen.

This design means that, as long as you know the exact paths that were used in the compilation, you can unambiguously match #file strings to those paths—just winnow the list down to the ones with that filename, then check if the SHA fragment (which might be empty) is a prefix of the path's SHA. However, it does mean that #file strings can still change spuriously based on the paths passed to the compiler—it just will only happen for same-name files (and #sourceLocations)—and the SHA fragment is useless if you don't know the exact paths used at compile time. I'm not sure if we can resolve this somehow, particularly without making matching more complex.

3 Likes

That would not only make identifiers much less predictable (I cannot calculate SHA-1 in the head) — it would also be possible that an identifier changes when a completely unrelated file is added to the codebase, wouldn't it?
I would prefer to always include a prefix with a fixed minimal length (it should be easy to avoid the need for looking at the checksum completely, and it has to be calculated anyways).

That's true, on macOS you can create a foo:bar file in the Terminal, although it displays as foo/bar in the Finder. I think this is because : was the path separator for HFS (on classic Mac OS).

Could the same partial-sha also be used as the discriminator of private and fileprivate symbols? The examples in test/SILGen/mangling_private.swift appear to include a full 128-bit hash of the filename.

I like the idea, but it also works with the /-separated grammar of your current proposal, and without the other :: problems you mentioned earlier.

This could be 7 characters, like on GitHub. (e.g. Foo/8c80e2f/c.swift)


UPDATE: Could you hash the file contents rather than the file path?

  • It would help with the "reproducible builds" motivation.
  • It could apply to all #file strings, regardless of name.
  • It could be compatible with the git hash-object command.
// <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects#_object_storage>
let fileSize = try! fileURL.resourceValues(forKeys: [.fileSizeKey]).fileSize!
let fileText = try! String(contentsOf: fileURL, encoding: .utf8)
let blobText = "blob \(fileSize)\0\(fileText)"
let blobData = blobText.data(using: .utf8)!
let blobHash = Insecure.SHA1.hash(data: blobData)

Then for a "MagicFile/62b124b/0274-magic-file.swift" example:

git show 62b124b                # Show source code
git cat-file -p 62b124b         # Show source code
git log --find-object=62b124b   # Show all commits
git describe --always 62b124b   # Show first <commit>:<path>
git rev-parse --verify 62b124b  # Show full ID or candidates
1 Like

I think it could. I don’t know if we’d want to do that right away, because it would affect @_private import.

That could work for real files, but not for #sourceLocation directives.

1 Like

Review Manager Update

Thanks to everyone for their feedback so far.

The core team has discussed the review thread and concluded that the proposal should be accepted broadly as proposed:

  • #file will be altered to only report the module and filename
  • #filePath will be introduced to replicate the previous full file+path for use cases that relied on the path previously.
  • While the team acknowledges that this does require some existing workflows to adapt to the new scheme, the binary size and privacy concerns over implicitly embedding the full path were significant enough to warrant this.

One area that the team felt still required addressing was the question of the format for the module+filename combination. The proposal explicitly left that unspecified, leaving open the possibility of changing the format at a later date. The core team prefers an approach of determining the format explicitly, allowing future tooling to rely on it to separate out the module name and file name.

As a result, the core team have asked the proposal author to amend the proposal to make the format of #file explicit, and a re-review focused specifically on this revision will be conducted.

Thanks to everyone for participating in this review.

Ben Cohen
Review Manager

7 Likes