Why whitespacelinter is so slow?

alaija · March 22, 2020, 8:22pm

Hello, I'd like to use swift-format, but faced that

swift-format -m lint <about_1000_lines_fille> takes about 25s

against

swiftlint lint <same_fille>, takes 0.25s

I run instruments and get that whitespacelinter takes all time:

Did someone investigate this problem?

allevato · March 22, 2020, 8:36pm

The call trees that you've left unexpanded in the screenshot of that Instruments run would provide more insight into which specific areas are slow.

25 seconds for a 1000 line file sounds excessive, though. When you ran swift build, did you include the -c release option? SwiftSyntax's syntax tree visitors have some complexity that receives significant speed-up when built with optimizations turned on.

Aside from that, there are certainly other areas in swift-format where we could invest more in performance improvements, but we've primarily been focused on stabilizing the core layout algorithm (making sure it's expressive enough for what we'll need/may need in the future) before moving on to those.

alaija · March 22, 2020, 9:17pm

Yes, I’ve used -c release option.

I’ll make investigation tomorrow, and provide all information.

alaija · March 23, 2020, 4:46am

xwu · March 23, 2020, 12:20pm

Do you have non-ASCII characters in your file by any chance?

The swift-format code in question operates on “characters” (which are extended grapheme literals), which is colossally expensive without the ASCII fast path. (nextCharacter does grapheme breaking on the entire remainder of the string in question.)

We’ve discussed before on this forum how source parsing should really use Unicode scalars for various reasons (for instance, combining characters in a string literal should not interfere with end delimiters). I suspect this is a demonstration of another reason why extended grapheme literals are not ideal for this use.

alaija · March 23, 2020, 12:31pm

I ran it over swift-format repo and got same result. (It wasn't finish oin 2 mins)

Can u check ur trace? I don't believe that it is something specific on my computer/files or something else

allevato · March 23, 2020, 2:21pm

Thanks for the additional info! Admittedly, linting has gotten less attention than formatting in our current priorities, but it looks like there's a lot of low-hanging fruit that we can fix up here.

Based on the number of accesses to count and calls to _opaqueCharacterStride, it looks like we're hitting bottlenecks even in the ASCII case—the character stride would be stored in the index, but since we reconstitute the whitespace runs that we're comparing as Strings instead of using Substrings, we can't use any of the original indices anymore and that information is thrown out and gets recomputed. So even if it's trivial to compute, it's an unnecessary hit.

But as you point out, there's no reason to use Character here at all, so switching this to use the UnicodeScalarView would probably yield significant savings on its own.

Lastly, the whitespace linter works today by running the pretty-printer on the source and then comparing whitespace runs in the original to the formatted output. So that's effectively making two passes over the source. Another improvement would be to have the pretty-printer retain the original whitespace trivia on the tokens and compare it to the whitespace that it would have printed during the layout process, but that's a deeper architectural change.

SDGGiesbrecht · March 23, 2020, 5:54pm

Far more alarming here is that nextCharacter(offset:data:) calls count on the entire string—twice. It does that inside a loop that runs for every non‐whitespace character in the string—twice per character.

Unicode does make a noticeable difference, probably because it slows that repeated iteration of the string. But it is negligible compared to the length of the file. I have several files I had to hide from swift-format because they ran longer than I had the patience to measure—I killed the process after twenty minutes. I never found the time to go back and look into it until now, but the first one I went back to find is essentially a single 1.5 MB base‐64 encoded string literal: i.e. no Unicode, little whitespace, extremely simple source tree, but very, very long.

A quick and dirty improvement that would work wonders would be to turn userText and formattedText into Array<Character> once in the initializer and let everything after that be random access.

A more thorough solution easier on memory would be to create a type that stores an index/offset pair and pass it around instead of lone offsets Then any string access can happen with the indices, but the arithmetic and reporting can still use the offset, and neither needs to be recomputed each time. (By string access, I mean to the main file strings, not the trivial, transient fragments assembled during the computation.)

While switching to scalars is probably a good idea for other reasons, I doubt it would change all that much, since they are no more random‐access than “characters”.

alaija · March 31, 2020, 7:48am

Thank you, @allevat, after fix it works much faster! swift-5.2-branch It's really amazing! Love it!