Improving the AST dump format

chrisbia · March 26, 2025, 12:06am

I meant going before the dump code in the compiler and working with the source code information in memory from within the compiler as opposed to getting the compiler to take that information, serialize it, and then us deserialize it in a custom program; or perhaps serializing it from the dump command in a custom, more intutive format.

jrose · March 26, 2025, 1:33am

Everyone else is already being helpful but I’m amused enough by being able to answer “does your position here stand” with “well, yes, but I’m no longer standing at it because I left Apple” that I’ll reply as well. My useful contribution is that it would be reasonable for the AST in the compiler to be the most reliable source of information about the source code, but in practice it is not that; it is the most reliable source of information the compiler needs from the source code. The implementation is used by both the compiler and SourceKit and so it’s going to be fairly complete, but it might not be stored in one place, or only computed lazily, or reused between phases of the compiler with different semantics, or with important details “missing” because they would increase memory usage or slow down compilation if the user hasn’t specifically asked for them, or even just not broken out because they’re not important for what the compiler needs. They’re not meant for consumption elsewhere.

(Years ago we got someone complaining that what we have is more of a concrete syntax tree than an “abstract” one, and, well, they’re probably right. That’s what happens when a fuzzy name sticks early on in development and an academic concept gets used as shorthand.)

I realize this doesn’t help you and probably won’t dissuade you from whatever you’re trying to accomplish. I’ll defer to everyone else for that.

Vinicius_Vendramini · March 26, 2025, 1:38am

For what it’s worth, I still stand by that decision too. Consuming the AST Dump worked after I contributed several bug fixes to the Swift compiler, but even then it required a lot of effort to adjust to every new swift version. It’s been years since I’ve used any of it, but I’d probably go with SwiftSyntax for the main structure and SourceKit for the type information if I had to write it again.

mouser · March 26, 2025, 1:42am

I suppose that's what I do: consume an in-memory dump of AST (from the compiler) and then pass the output into a tokenizer that is language-agnostic (I currently support ObjC and Swift but could add any number of languages, based on file extension). This tokenizer uses a generic srtucture and, while I dont care for most of the AST output (for now), I snatch up the most important ones like class, methods, functions, enums (and then some) location in sources. I track ancenstry for the graphic layout of the classes and for quick perusal in my object browser editor.

Otherwise, this app always scanning the source for any user activity would kill a laptop battery in no time. :-)

Even if re-scanning a source is super fast even on an M1 MBA.

Slava_Pestov · March 26, 2025, 2:44pm

You inspired me to remove the broken behavior of subst() with GenericFunctionType and fix the few remaining callers that rely on it, which is something I've wanted to do ever since I discovered that code path exists: Clean up GenericFunctionType substitution by slavapestov · Pull Request #80301 · swiftlang/swift · GitHub.

This also revealed the crash location which I think you were talking about with parameter packs, where mangleTypeAsUSR() calls getTypeForDWARFMangling(), which had an unnecessary use of subst(). This was totally my fault for introducing it in the first place. What happened is that your AST dumper is passing a GenericFunctionType to mangleTypeAsUSR(), which is something that hadn't been done before. This exercised the broken code path. You might want to revisit your examples and see if they're fixed once the above PR lands.

allevato · March 27, 2025, 1:13pm

I saw that you just merged it and tried the old example. It's working now, thanks! The interface type for g in the code below is being reported as $syyqd__c3lib4PackVyxxQp_QPGcD (lib.Pack<Pack{repeat A}>) -> (A1) -> (), which looks right to me.

struct Pack<each T> {
    func f(_ t: repeat each T) {
      repeat g(each t)
    }
    func g<U>(_ t: U) {}
}

On the opaque type issue, I'm still seeing the conformance not being included when I pass the opaque archetype directly into printTypeUSR (I had to remove an overly aggressive assertion that bans all archetypes in that function), but I'll keep plugging away at it; I haven't had time to really sit down and debug it yet.

Slava_Pestov · March 27, 2025, 1:19pm

Where is this assertion? Types that contain opaque archetypes answer false to hasArchetype() which is the condition we usually check.

allevato · March 27, 2025, 1:30pm

In USRGeneration.cpp, I trap on the linked hasArchetype() assertion when trying to generate the USR for a type corresponding to an UnderlyingToOpaqueExpr node on the line marked below:

protocol P {}
struct S: P {}

func baz<T>(_ x: T) {
  func foo() -> some P {
    return S()  // <- here
  }
}

The type dumped right before the assertion:

(opaque_type address=0x130a0afc0 conforms_to="lib.(file).P@/Users/allevato/Scratch/jsonast/lib.swift:13:10" decl="lib.(file).baz(_:).foo()@/Users/allevato/Scratch/jsonast/lib.swift:17:8"
  (interface_type=generic_type_param_type depth=1 index=0 param_kind=type)
  (substitution_map generic_signature=<τ_0_0 where τ_0_0 : Copyable, τ_0_0 : Escapable>
    (substitution τ_0_0 -> 
      (primary_archetype_type address=0x130a07580 conforms_to="Swift.(file).Copyable" conforms_to="Swift.(file).Escapable" name="T"
        (interface_type=generic_type_param_type depth=0 index=0 name="T" param_kind=type)))
    (conformance type="τ_0_0"
      (abstract_conformance protocol="Copyable"))
    (conformance type="τ_0_0"
      (abstract_conformance protocol="Escapable"))))

I guess this isn't directly because of the opaque type, but because the substitution map refers to the T from the parent context? EDIT: Yes, if I remove baz<T> and just leave foo, it doesn't assert. But I still need to figure out how to handle both cases.

Slava_Pestov · March 27, 2025, 2:50pm

Ah, this isn’t about opaque archetypes at all. That type contains the primary archetype T, and you will get the same crash if you try to mangle any type that contains a primary archetype, like Array<T>. You can use TypeBase::mapTypeOutOfContext() to replace primary archetypes with type parameters before mangling.

allevato · March 27, 2025, 3:12pm

Thanks, I can confirm that works (and is probably more sound than the ad hoc transformRec I'm currently doing to swap out archetypes with interface types).

But just to jump back to the original question about the opaque type, even now that I'm not converting anything needlessly to interface types, when I call printTypeUSR to mangle some P, that mangling doesn't appear to reference P anywhere—should it? What I end up with is:

"$s3lib3fooQryFQOyQo_D" -> "<<opaque return type of lib.foo() -> some>>.0"

The unpaired some at the end is what bothers me. I can track down P from the decl that owns it because it has an opaque_result_decl that lists the conformances, but it would be nice if I could use the type mangling as the sole source of information here. But if my expectations aren't correct, then we can build something that works around it, like a separate table that maps opaque type manglings back to the corresponding decl.

Slava_Pestov · March 27, 2025, 4:10pm

I think the "sole source of information" that you're after here is the generic signature of an opaque result declaration. (OpaqueTypeDecl::getOpaqueInterfaceGenericSignature()).

If I write

func foo<T>(_: T) -> (some Sequence<T>, some Any) {...}

Then the function declaration foo() also has an associated opaque result declaration which it points to, and vice versa. You can visit it while dumping foo() itself. Alternatively, every SourceFile has a table that maps those mangled names to opaque result declarations; you can dump this table instead. (This is how we resolve @_opaqueResultOf in a module interface, for example.)

The generic signature of foo() itself is <T>, but the generic signature of its associated opaque result declaration adds two new generic parameters:

<T, R0, R1 where R0: Sequence, R0.Element == T>

The actual return type of foo() is a tuple type that contains two opaque archetypes. Both archetypes refer to the same opaque result declaration. The first archetype's type parameter is R0, and the second archetype's type parameter is R1. Note that R1 is unconstrained in the generic signature.

allevato · March 27, 2025, 4:39pm

Right—we dump that currently alongside the function that declares that opaque type:

		{
			"_kind": "func_decl",
			"usr": "s:3lib3fooyQr_QR_txlF",
			...
			"result": "$s3lib3fooyQr_QR_txlFQOyxQo__AaByQr_QR_txlFQOyxQo0_tD",
			"opaque_result_decl": {
				"_kind": "opaque_type",
				...
				"declared_interface_type": "$s3lib3fooyQr_QR_txlFQOyxQo__AaByQr_QR_txlFQOyxQo0_tD",
				"generic_params": [
					"$sxD",
					"$sqd__D",
					"$sqd_0_D"
				],
				"reqs": [
					{
						"_kind": "requirement",
						"first_type": "$sxD",
						"kind": "same_type",
						"second_type": "$s7ElementSTQyd__D"
					},
					{
						"_kind": "requirement",
						"first_type": "$sqd__D",
						"kind": "conforms_to",
						"second_type": "$sST_pD"
					},
					{
						"_kind": "requirement",
						"first_type": "$sqd_0_D",
						"kind": "conforms_to",
						"second_type": "$sypD"
					},
					{
						"_kind": "requirement",
						"first_type": "$sqd_0_D",
						"kind": "conforms_to",
						"second_type": "$sypD"
					}
				]
			},

So I think I have everything I need, as long as I correctly map the depths/indices back to the right generic parameters. (The conforms_to requirements here have the issue of collapsing invertible protocols, which I still need to fix.)

The use case I'm thinking of is that we'll have logic in our analyzer that asks the question "what type is <arbitrary expr>?", to which the answer might be that opaque type mangling. So if I'm passing that around and then need to reason about the specific conformances later, I'll need to make sure I have a handle back to the original opaque type decl, since I can't extract it from the type itself. Thanks for the pointer to the mapping in the SourceFile—I didn't know about that, and it might be easier to work with than trying to remember arbitrarily nested decls.

Slava_Pestov · March 27, 2025, 4:51pm

There’s a getRequirementsWithInverses() that undoes the transformation, but I don’t think you want to do that. A dump should faithfully represent the actual state of affairs.

I’m wondering if it might be better for your use case to mangle opaque types as erased existentials, eg some P just becomes any P, etc.

The question “does a type parameter T conform to a protocol P in a generic signature G” is extremely difficult to answer for arbitrary T and G unless you’re the compiler. Looking at the explicit requirements is insufficient when T is a dependent member type, because of protocol inheritance, same type requirements, etc.

Specifically the case I’m wondering about is something like

func f<T>(…) -> some Sequence<T> {
  …
}

let s = f(…)
let iter = s.makeIterator()

Now the type of iter is an opaque archetype with the same generic signature as s, but whose type parameter is a dependent member type derived from the type of s. In fact iter conforms to IteratorProtocol, but this isn’t explicitly stated in the signature of the opaque result declaration.

An existential type loses some of the generality but it’s easier to reason about.

allevato · March 27, 2025, 5:08pm

Yeah, we're currently combining multiple sources of information like indexstore to help us navigate protocol inheritance hierarchies by computing the chains ourselves when we need to, but that can run into issues with, say, a conditional conformance via extension, where we would need to start evaluating where clauses to decide if we should walk a certain edge or not. I'm not too keen on implementing a parallel type checker here.

This sounds like an interesting idea; I'll see how it shakes out. Maybe for any opaque type, we can show both the opaque type mangling and the equivalent existential mangling in the AST dump, and then we can decide which one (or both) we want to use at a later point.

Slava_Pestov · March 27, 2025, 5:15pm

Inheritance among protocols is straightforward (nothing is conditional), but yeah, once you start looking at concrete types, you're back to needing a generics implementation if you find yourself needing to evaluate conditional requirements. (However, you could maybe generate a table of all concrete types that appear in the output, with the protocols each one conforms to, perhaps).

I hope that one day, we will have a nice reusable library to handle such questions.

You can generate a table whose keys are every opaque return type that actually occurs in the output, and values are the erased existential types of each one (there's a ArchetypeType::getExistentialType() method that does this).

allevato · April 1, 2025, 4:41pm

I'm working on wrapping this up, and it looks like mapTypeOutOfContext() alone isn't sufficient to get a type that ASTMangler can handle. After calling it, in some cases I still have ElementArchetypeTypes or ExistentialArchetypeTypes (or something like a FunctionType that contains those).

The implementation of mapTypeOutOfContext looks like it only handles primary and pack archetypes, so that seems to match what I'm seeing; is that expected?

I also noticed this recent PR ([Mangler] Handle local archetypes in `getDeclTypeForMangling` by hamishknight · Pull Request #78855 · swiftlang/swift · GitHub), which touches on some of this, but I can't directly do something similar without updating ASTDumper to pass a DeclContext through to all the individual printers. I can make that change if it's necessary to get safe types for USR generation, though.

Slava_Pestov · April 1, 2025, 5:08pm

The simplest thing to do would be to skip expressions whose types contain local archetypes. If you just want the USR to be unique, you can walk the type to collect all generic environments of all local archetypes, and then use mapLocalArchetypesOutOfContext() to rewrite them into distinct type parameters. This loses information though, but it’s probably fine. Otherwise, you can imagine a more general encoding where we record the relevant information about each local generic environment in a side table.

Slava_Pestov · April 1, 2025, 5:16pm

Another slightly hacky solution would be to replace local archetypes with their existential upper bounds before mangling. Again this loses information compared to encoding the generic signature of each local generic environment, but there isn’t much you could do with a raw list of requirements anyway without reimplementing the requirement machine.