SourceKit get 2D character indices (instead of byte indices)

taylorswift · July 23, 2018, 9:29pm

How can I get 2D (line, column) character indices from the SourceKit syntaxmap? The fields key.offset and key.length only give a linear byte range which obviously causes problems when unicode comes into play.

tkremenek · August 17, 2018, 1:05am

CC @akyrtzi @Nathan_Hawes @Xi_Ge @blangmuir

akyrtzi · August 17, 2018, 2:13am

The offset is a byte offset into a UTF8 string, you need to take that into account when using string processing facilities if you want to convert it to UTF16 or UTF8 character index.
I'm not sure what you mean by "obviously causes problems", it's not like the byte offset may point into an arbitrary offset inside the string, it's guaranteed to point at the beginning of a UTF8 character.

taylorswift · August 17, 2018, 2:36am

the problem is in getting a mapping from the byte offsets to character offsets. Ideally, the highlighter shouldn’t care about the contents of the text buffer, it should just pass it opaquely to SourceKit and get character-indexed tokens in return, since otherwise we’d have to use ICU and find the character boundaries within the highlighter, and then search them to map the byte offsets. I’m already doing basic text buffer preprocessing to catch newlines so the 1D indices can be converted to 2D, but the lag time is about at the upper limit of what you would notice while typing (it’s currently only really usable for swift files <1000 LOC, though that’s more Atom & javascript’s fault). Redoing grapheme breaking (which I assume, SourceKit is already doing internally) would probably increase the lag to unacceptable levels. Could SourceKit expose the character indices directly?

Here’s some data on the latency: Using Github's Atom as a Swift IDE for Linux and Mac - #49 by taylorswift

Javascript is mostly to blame but we really don’t have many milliseconds to spare as a result.

akyrtzi · August 17, 2018, 2:47am

This is not how it works, the swift compiler internally accepts a UTF8 string for the source buffer and then only deals with byte offsets. For example, even the (line:column) pairs you see in the diagnostics that the compiler emits are actually (line:"UTF8 byte offset from the start of the line") pairs. The notion of character indices doesn't exist internally.

taylorswift · August 17, 2018, 2:49am

hmm. that’s discouraging. I guess we have to do grapheme breaking in the highlighter then.