There are a couple problems with ::
as the separator:
-
It's a little weird to introduce this syntax in the #file
string before we introduce it in the language itself.
-
It's possible to have :
in a filename on some platforms. This means that only one component of file-string can be a path component, which limits what we could do with a disambiguator field in the future.
-
Parsing the ::
separator in Swift code would be awkward because String
doesn't currently have searching or splitting calls that work on more than a single character. You'd need to import Foundation to do it. (Or wait for the "matchers" proposal or something similar to develop and become part of the language.)
However, this did give me an interesting idea that also solves the same-file-name issue. Consider this:
file-string → module-name ":" partial-sha ":" file-name
partial-sha → [0...64 characters from ASCII "0"..."9" and "a"..."f"]
partial-sha is a prefix of the hex representation of the #filePath
string's SHA-1 checksum. (Why SHA-1 instead of something more secure? It's not being used for security here and most programming environments, including llvm/Support, already implement it.) The compiler adds only enough hexits to distinguish between two otherwise identical strings. So if you don't have any overlaps in filenames, you end up with ::
, but if you do, there are one or more characters between them.
Swift implementation (quick-and-dirty, requires CryptoKit)
import Foundation
import CryptoKit
extension Digest {
var hex: String {
let str = String(lazy.flatMap { String(format: "%02x", $0) })
assert(str.count == Self.byteCount * 2)
return str
}
}
/// - Returns: A dictionary with `#file` string keys and `#filePath` string values.
func fileStrings(forFilePaths filePaths: Set<String>, inModule module: String) -> [String: String] {
struct FileStringCandidate {
init(path: String) {
self.path = path
self.pathSHA = Insecure.SHA1.hash(data: path.data(using: .utf8)!).hex
}
let path: String
let pathSHA: String
var name: String { URL(fileURLWithPath: path).lastPathComponent }
func partialPathSHA(withLength length: Int) -> Substring {
assert(length <= pathSHA.count, "could not unique #file strings even with full SHA")
return pathSHA.prefix(length)
}
func fileStringFragment(withSHALength length: Int) -> String {
"\(partialPathSHA(withLength: length)):\(name)"
}
}
/// The #file-to-#filePath map for all fully computed strings.
var finished: [String: String] = [:]
/// Current length of new entries into the working dictionary.
var workingPartialSHALength = 0
/// Holds candidates that we are not sure we've finished working on.
var working: [String: [FileStringCandidate]] = [:]
/// Adds candidates to `working` under keys with `workingPartialSHALength` SHA characters.
func scheduleUnderNewKeys(_ candidates: [FileStringCandidate]) {
assert(workingPartialSHALength <= 64)
working.merge(
candidates.map {
($0.fileStringFragment(withSHALength: workingPartialSHALength), [$0])
},
uniquingKeysWith: +
)
}
scheduleUnderNewKeys(filePaths.map(FileStringCandidate.init(path:)))
while !working.isEmpty {
workingPartialSHALength += 1
for (fileStringFragment, candidates) in working {
// In both cases, we remove the old entry.
working[fileStringFragment] = nil
if candidates.count == 1 {
// Move this to the finished list.
finished["\(module):\(fileStringFragment)"] = candidates.first!.path
}
else {
// Re-add candidates with a longer SHA.
scheduleUnderNewKeys(candidates)
}
}
}
assert(finished.count == filePaths.count, "lost some #filePath strings along the way")
return finished
}
print(fileStrings(forFilePaths: ["/src/a.swift", "/src/b.swift", "/src/c.swift", "/tmp/c.swift"], inModule: "Foo"))
For instance, module Foo
with these four files would get:
#filePath string |
#file string |
SHA-1(#filePath string) |
/src/a.swift |
Foo::a.swift |
b8a19407a17b4607665958d1f5f47b4c80445df2 |
/src/b.swift |
Foo::b.swift |
18b2dcdcb10c918b4318b27096df8c285abf4d29 |
/src/c.swift |
Foo:c:c.swift |
cfc12f3697ad6853e86d3517fade1dff1dc371ac |
/tmp/c.swift |
Foo:8:c.swift |
8c80e2fed5328e154bd8e14ae3d1be1c85335abc |
If the first character of /tmp/c.swift
's SHA had also been "c", the two files would have ended up with the #file
strings Foo:cf:c.swift
and Foo:cc:c.swift
, respectively. In theory, the compiler could go all the way to 40 characters, although that's vanishingly unlikely to ever happen.
This design means that, as long as you know the exact paths that were used in the compilation, you can unambiguously match #file
strings to those paths—just winnow the list down to the ones with that filename, then check if the SHA fragment (which might be empty) is a prefix of the path's SHA. However, it does mean that #file
strings can still change spuriously based on the paths passed to the compiler—it just will only happen for same-name files (and #sourceLocation
s)—and the SHA fragment is useless if you don't know the exact paths used at compile time. I'm not sure if we can resolve this somehow, particularly without making matching more complex.