Improving the AST dump format

Vinicius_Vendramini · September 5, 2018, 2:42pm

Hi all!

I would very much like to (attempt to) improve the way the Swift compiler prints its AST.

Currently, the -dump-ast option prints the AST in a format that is similar to S-Expressions, which is supposed to be relatively easy for other programs to parse. However, I've been finding it hard to create a reliable parser for this ast format. It seems to me that the output was never meant to be parsed by other programs (instead, it's meant for human eyes only), hence the difficulty.

Would there be any interest (or opposition) from the community to implementing support for a different format such as YAML or JSON?

jrose · September 5, 2018, 4:14pm

As I've mentioned before, I'm minorly against this, because I don't think parsing -dump-ast output is a good idea. We don't actually want to promise that this is a stable or sensible format, and we don't want people writing tools that depend on swift -frontend (which is itself an internal, unstable interface). But others may feel differently.

Aciid · September 5, 2018, 4:20pm

How about using SwiftSyntax for this?

allevato · September 5, 2018, 4:26pm

Anything using SwiftSyntax already implicitly depends on swift -frontend -emit-syntax so it feels like that boat may have already started leaving the harbor (unless we say that SwiftSyntax is one such "blessed" interface through which the frontend can be used).

But to the larger point, I agree—rather than try to change -dump-ast at this point (its concise ANSI-colored form is very nice for debugging and I wouldn't want to exchange that for a much more verbose data format), it would be nice to have a new mode analogous to -emit-syntax that contains semantic information instead of just raw syntax.

The output of SwiftSyntax isn't substitutable for -dump-ast because it only has syntax information; so for example, you can get the identifier for a type usage, but you can't easily determine which module it was imported from, whether it's a typealias for something else, etc. Similarly, IIRC -dump-ast contains synthesized declarations, whereas SwiftSyntax would not have those since it only presents what's actually in the source file.

jrose · September 5, 2018, 4:28pm

SwiftSyntax isn't stable yet either; when it is, it won't use -frontend. (*cough cough* @Xi_Ge, @akyrtzi)

akyrtzi · September 5, 2018, 4:46pm

The currently supported way to get semantic info from a source file is to use sourcekitd, via its 'index' or 'doc-info' request. Note though that those do not contain everything that an AST has, e.g. you don't get every expression and its type.

Vinicius_Vendramini · September 5, 2018, 5:16pm

Yeah, that’s the thing for me. I need a lot of type data, and I can’t think of a way to do this without the ast dump
I remember looking at sourcekit a while ago but I seem to remember it also didn’t have everything.

Vinicius_Vendramini · September 5, 2018, 5:32pm

Alternatively, I wouldn’t mind having more information from sourcekit

akyrtzi · September 5, 2018, 8:18pm

The general goal with sourcekitd requests is to provide a reasonably stable interface, and not be tied to internal details of the Swift AST that can change at any point.
If you'd want output that describes as much details and structure of the AST as possible and essentially ties itself to internal details then there's not much benefit going with a sourcekitd request versus adding an option for the compiler to provide that output.

Vinicius_Vendramini · September 6, 2018, 10:43pm

Yeah, I've taken some time to look at what sourcekitd can do, and it doesn't seem to have a lot of information that's only available in the AST level.

I understand that the AST is meant for internal use (as it should be) and isn't stable, so it will be on me to deal with changes made in future versions of the language. I also understand not wanting to provide a standardized output for fear it may mislead programmers into thinking it's a stable API of some sort.

However, as there is substantial source code information that's only available in the AST, I feel it can be beneficial to provide this standardized output nonetheless (perhaps with a warning of some sort). This would allow programmers that are willing to do this extra work to access this information in practice.

I also agree with @allevato that -dump-ast is useful for its own reasons, and that it would be bad to replace it completely. Perhaps the -dump-ast code could be refactored to separate the output format logic, and then an alternative output format could be added as well. This process would likely fix a handful of inconsistencies I've identified in the current -dump-ast code, and might also make it easier to maintain.

mouser · November 3, 2021, 5:51pm

There's a fundamental limitation in "swift -frontend -emit-syntax" where any symbol described does not have line/col or global source offset where the symbol is defined. Or I have missed an option to generate all these.

At this point I have a swift class capable of loading the entire JSON structure and surrounding code that outputs source-independant "tokens". These tokens are the same as those I generate from clang "clang -cc1 -ast-dump".

I use these in a custom app that is a workspace/project source/class browser (remember ObjectMaster?) where I can browse and edit sources. While I can pull up ObjC source on a per-method basis, the same is not (yet) possible in swift because the -emit-syntax doesn't yield source location information.

I can still use the app to cross-reference all code usage to spot dead code (my current goal for this app), but it would be superb to be able to display functions are they are browsed, rather than resort to full-source.

Vinicius_Vendramini · November 8, 2021, 1:33am

Hey @mouser, I'm not sure if this will help you but lately I've been using libSyntax to get the AST and SourceKit to get the type information (instead of the AST dump). It's a bit hard to match libSyntax's information with SourceKit's, but looking at your screenshot maybe libSyntax alone will solve your problem.

Here's my code that deals with libSyntax and SourceKit, I hope it helps: Gryphon/SwiftSyntaxDecoder.swift at release · vinivendra/Gryphon · GitHub

chrisbia · March 25, 2025, 9:19pm

Does your position here still stand or have things changed significantly in this area since then? I'm wondering as the AST seems like a pretty integral part of the Swift language so have to imagine it's one of the most reliable — if not the most reliable — source for structured information about the source code.

chrisbia · March 25, 2025, 9:19pm

Still working with ASTs/taking this approach? If so, was wondering why you navigated away from just using the dump?

Slava_Pestov · March 25, 2025, 9:25pm

I wouldn't recommend relying on -dump-ast output for anything serious. The trouble is three-fold:

The internal AST has some quirks, incidental complexity, and historical baggage. It does not always model language features in the way you'd expect.
The internal AST is not stable at all. It can change radically between compiler releases.
On top of that, the -dump-ast flag just prints a certain cross-section of the AST, in a loosely-defined format that was intended for visual inspection by compiler developers and not for parsing by tools. It's quite far from a true serialized representation of what's really there.

allevato · March 25, 2025, 9:31pm

You might be interested in the -dump-ast-format json flag that was added in Swift 6.1. It aims to provide the type-checked AST in a form that's more parsable than the default S-expression output.

However, there are caveats:

Everything Slava mentioned above is still true: the representation can be unusual, and there is intentionally no guarantee of stability between compiler versions, even for the parsable format.
Types and references are reported in a fairly opinionated format, using USRs. The main reason for this is to balance size and richness of the output. Decl reference USRs are the same as what is emitted into indexstore so you can cross-reference those with existing index store data, and type USRs can be demangled to get the full type you're interested in. But that means that using the data will involve leveraging the demangler.
There are still some known bugs/crashes involving things like invertible protocols and nested parameter packs.

My goal is to eliminate the bugs/crashes in #3 eventually, but it's not likely to happen during the Swift 6.2 timeline, either.

Slava_Pestov · March 25, 2025, 9:40pm

For the first problem, I don't think you want to dump the types in the inheritance clause as they are written. Instead, you want to dump the derived semantic information by calling various getters on the declaration:

the superclass type for a class
the local conformances for a nominal type or extension
inherited protocols of a protocol
the raw type for an enum decl

For opaque archetypes, you don't want to mangle those as their interface type because that loses information. The mangler accepts opaque archetypes.

For the last problem, you're calling subst() on a GenericFunctionType, which does some weird stuff that is usually not what you want. Why are you calling subst() on the type being mangled there?

allevato · March 25, 2025, 10:21pm

Only because I don't 100% know what the right thing to do is in some of these situations! Thanks for those pointers—I'll see if I can make some more progress soon.

chrisbia · March 25, 2025, 11:55pm

So if I wanted a true, serialized representation of what's there then I would have to work with the part of the Swift compiler where the AST dump is derived from?

mouser · March 25, 2025, 11:59pm

That's what I ended up doing (see my 3yo reply above).

I switched from the JSON output to the compiler's AST dump which looks similar to the objc AST dump, while being completelly different. It's a nasty parser to write.
Still, the output is 10x smaller. So, still more efficient.
See the result of what I use it for in XCodeMaster