I'm interested in learning how Swift implements its lack of header files, specifically as it relates to type checking. Are there certain files that I could study to gain a better understanding of the data structures and algorithms used to support this?
The closest analogy to C-style header files in Swift is the
.swiftmodule file—it's a binary format (LLVM bitcode) file that records the declarations (types, function signatures, etc.) for a module. Then, when you import that module from another module that you're compiling,
swiftc loads that
.swiftmodule file into memory so that it can typecheck your usage of the decls in that module. The object files/static archives/dylibs for those modules don't need to be available until link time, just as with C.
.swiftmodule file also contains some additional data, like embedded SIL for
@inlinable functions. By having the code for inlinable functions available as SIL, the compiler is able to take that SIL and drop it in where the function is being called in a different module and then further optimize it, more-or-less as C++ compilers do when the source for a function is inlined into the header.
I'm sure there's other stuff I'm glossing over, but those are the parts I'm most familiar with.
Code-wise, a good place to start is in swift/lib/Serialization, where much of the logic to serialize/parse
.swiftmodule files is implemented.
If you've built the toolchain from source, the
llvm-bcanalyzer tool is great for examining the contents of these files in a somewhat more human-readable way. You'll find it in the subdirectory for
llvm's build artifacts, not
swift's, and you can run it like this:
$ llvm-bcanalyzer -dump MyModule.swiftmodule
Thanks so much for your explanations. I actually wasn't even considering the cross-module part, but I'm glad you brought it up. I'll be sure to check out those files and tools.
Do you know anything about how it works within a single module? For example, if I'm building Module M that includes A.swift and B.swift, what allows B.swift to use types from A, and vice-versa? My understanding of a language like C is that the preprocessor simply copies declarations in headers into the importing translation unit and the compiler type-checks one translation unit at a time, which wouldn't work for Swift.
The documentation page for the driver goes into this in some detail: https://github.com/apple/swift/blob/master/docs/Driver.md
If you compile a module with
-###, you'll see the invocation details for the lower-level frontend jobs that get executed. Note that when you compile a module, you must pass all the source files in that module to the driver. When compiling without whole-module optimization, you'll see the driver spawn one frontend invocation for each source file. Each invocation lists all the source files in the module, but each has a different "
-primary-file". The primary file is the one for which code is actually generated—the compiler does lazy typechecking using the other files to make sure that your in-module references are correct. (So it must be doing some amount of parsing for them, which would be repeated work; I don't know enough about the details of this to gauge the overhead.)
Each of these frontend jobs produces a "partial"
.swiftmodule file, and then once they're all complete, a final "merge modules" job runs that takes them all and combines them into a single
.swiftmodule for the whole thing.
If you compile with whole-module optimization enabled, the behavior is a bit different—the driver only invokes a single frontend job and typechecks and compiles all the source files at once. Incidentally, this is why folks often find better compile-time performance with WMO enabled, but the new "batch mode" is supposed to address this for non-WMO builds.
The language is quite deliberately designed to minimize the amount of work needed to understand the declarations in a file. For instance, this is part of the reason why you have to explicitly state function parameter and return types—they don’t want to need to type-check the bodies of functions in other files.
My understanding is that the compiler looks at the function and type declarations only - not implementations - in the non-primary files. This gives enough info to type-check. A lot of the language syntax deliberately supports this trade off.
Absolutely—the compiler can take a lot of shortcuts when it's parsing the non-primary files. I mainly wanted to call out that it's a non-zero amount of work for it to still do lexical analysis and basic parsing of those files. But in the end, this is conceptually similar to the word a C compiler has to do when it compiles a set of
.c files that include the same headers, I imagine.
I appreciate everyone’s responses on this, and I look forward to digging in more on my own!