Changing comment trivia attribution from trailing trivia to leading trivia

ahoppen · July 30, 2021, 3:09pm

The current rule how trivia are split between leading and trailing trivia, is that all trivia are trailing up to the first comment or newline. Thus, comments are always leading trivia of the next token.

This, however, is causing unintuitive behavior in some cases. For example in

func foo() { // comment
    bar()
}

// comment is a leading trivia to bar. I would argue that it would be more natural to view it as trailing trivia to {.

Thus, I’m proposing to change the trivia attribution rule to make all trivia trailing up to the next newline only, which would make // comment be a trailing trivia to {.

This would also change the way inline trivia are distributed to their tokens, for example

         Old comment-token matching
┏━━━━━━━━━━┓┏━━━━━━━━━━┓┏━━━━━━━━━━┓┏━━━━━━━┓
/*X_START*/x/*X_END*/ + /*Y_START*/y/*Y_END*/
┗━━━━━━━━━━━━━━━━━━━━┛┗━━━━━━━━━━━┛┗━━━━━━━━┛
         New comment-token matching

IMHO neither attribution is more intuitive than the other, so I don’t see it as an argument against the change.

Also note that doc comments are still attributed as leading trivia to the next token if they start on a newline (which they should do in basically all cases).

Would people be comfortable with this change or are there any advantages of the current attribution rule that I have overlooked?

blangmuir · July 30, 2021, 4:35pm

It's interesting that neither of them produces what to a human would be obvious:

┏━━━━━━━━━━━━━━━━━━━┓   ┏━━━━━━━━━━━━━━━━━━━┓
/*X_START*/x/*X_END*/ + /*Y_START*/y/*Y_END*/

We could imagine handling that using spacing rules, but of course that wouldn't help if the spacing was uniform, so it's not clear that it has much value to handle the special case.

Can you clarify, does the trivia attribution affect doc comment attaching to a decl when it's not on a new line, or is doc comment attachment independent?

var foo: Int /// Comment
var bar: Int

Currently this doc comment would be attached to bar, although obviously it fits better with foo like in your proposal. Incidentally, for clang/doxygen there is a specific syntax for trailing comments with an extra '<'.

int foo; ///< Comment

ahoppen · July 30, 2021, 5:11pm

We could imagine handling that using spacing rules, but of course that wouldn't help if the spacing was uniform, so it's not clear that it has much value to handle the special case.

I agree. But I think everything that even gets close to some sort of that “smart” behavior is too complex to reasonably fit into the lexer. And I assume it also depends on the application, so I think it’s reasonable to require some cleanup logic on that side. It’s easy enough to get the next token’s leading trivia as token.nextToken.leadingTrivia.

Can you clarify, does the trivia attribution affect doc comment attaching to a decl when it's not on a new line, or is doc comment attachment independent?

Doc comments are treated like normal comments, so under the new rules, the comment would be attached to the first Int (after foo).

I was trying to say that

func foo() {}
/// doc comment for bar
func bar() {}

will not be affected since the comment starts on a new line.

SDGGiesbrecht · July 30, 2021, 10:06pm

Would a helper method to retrieve (and maybe modify) the combined Trivia composed of that which trails the preceding token and leads the current token be more useful?

While the parser is consistent, the factories allow just about anything, so I learned a while ago not to trust other code to have put the trivia in the “right” place. I now tend to think of trivia as neither “leading” nor “trailing”, but “between” and habitually inspect both just in case.

allevato · July 31, 2021, 4:02am

Just to throw out a contrived example that the current scheme perhaps handles better than the proposed one:

let x: Int  // this is a long comment
            // that is continued across
            // multiple lines
someOtherCode()

Under the current scheme, that comment falls entirely within a single trivia sequence. Under the new scheme, the first line would be detached from the rest of the comment. And I definitely wouldn't advocate for any advanced logic in the lexer that would try to guess based on indentation whether those subsequent comment lines were continuations or not.

Now, I don't think that comment style is very common and we shouldn't design around it per se, but the current approach does have a couple of nice consistency properties:

Trailing trivia only ever consists of spaces and tabs (and, rarely, garbage text).
The text of a single "conceptual" comment is never broken across two different trivia sequences.

I don't know if this necessarily makes it worth keeping, but I don't know whether there are major advantages to the proposed approach either. I think the ideal trivia representation might depend too much on the specific coding conventions of the source being parsed, and so clients that want to ensure that they don't drop anything (like swift-format) are going to look at the combined trivia between neighboring tokens, like @SDGGiesbrecht points out.

ahoppen · September 1, 2021, 2:18pm

Sorry for the delay and thanks for the feedback. I have been thinking about this a little more and believe that the proposed default splitting scheme is more natural in the majority of cases (as @blangmuir also pointed out). But I think it also became clear (by @SDGGiesbrecht’s and @allevato’s comments) that having a way to split the trivia in a more sophisticated way would also be beneficial.

Proposal

I would propose that we

Change the default trivia lexing behavior to the one I described in the original post
Add the following convenience methods to Syntax (names not finalized yet). Users of SwiftSyntax could then add custom trivia-splitting schemes (e.g. one that considers indentation which would handle @allevato’s case correctly). If such an advanced splitting scheme turns out to be generally useful, we can consider adding it to SwiftSyntax itself.

/// Returns all trivia occuring directly in front of this syntax node. This does not split trivia into leading and trailing; this node’s leading trivia is equivalent to the previous token’s trailing trivia.
var allTriviaBefore: Trivia? {
	return previousToken.trailingTrivia + leadingTrivia
}

var allTriviaAfter: Trivia? {
	return trailingTrivia + nextToken.leadingTrivia
}

Add legacyLeadingTrivia and legacyTrailingTrivia (again, names not finalized yet) that provide the current trivia behavior by using allTriviaBefore and allTriviaAfter and applying the current splitting rules. It should help migrating to the new behavior.

IMHO this would give us a simple and slightly improved default behavior that’s performant to access. Access through allTriviaBefore and allTriviaAfter would be less performant (because we need to look up previousToken or nextToken resp.) but offers greater flexibility.

Alternative considered

To avoid API breakages, I considered naming legacyLeadingTrivia just leadingTrivia and implementing the new trivia behavior as triviaBefore and triviaAfter variables. I decided against it because I think that leadingTrivia and trailingTrivia are too good names to waste on transitional behavior. And I think breaking the API isn’t too much of an issue since SwiftSyntax doesn’t promise a stable API at this point anyway.

What do you think?

xwu · September 1, 2021, 6:06pm

Anyone not well-versed in the minutia of swift-format history isn’t going to remember or even know to think about which behavior is “legacy” and which is “new.”

More usable, I think, would be specifying “greedy” versus “lazy” for both leading and trailing trivia and letting the user choose.

allevato · September 1, 2021, 8:06pm

My concern would be less about stable vs. unstable API, but rather about silently changing the behavior of an existing API.

In the past, nearly all API changes to SwiftSyntax have been of the kind that once you update that dependency, client code fails to compile and users can work through the compiler errors to fix usage sites. With the change you're proposing, users won't have the compiler's help to identify locations that need updating; they'll just need to text-search the code base for occurrences of those properties and update them, and filter out any places where they may have used that terminology elsewhere (comments, local variables, helper properties'/methods' names in other types, etc.). That makes me a lot more nervous about getting everything right, and even thorough unit tests might miss something.

It would be great if we could have a transitional period instead of flipping the switch all in one commit; e.g., using @available(renamed:) so we can see usages flagged as warnings and update them before the behavior of the original property is going to be changed, even if that name is going to be reused in a later release.

ahoppen · September 20, 2021, 2:37pm

That is a good point. What do you think about calling the new accessor methods triviaBefore and triviaAfter, while leadingTrivia and trailingTrivia retain their current behavior and are marked deprecated?