Grapheme cluster breaking in the compiler for diagnostic printing

Hi all,

I'm currently working on implementing a new diagnostics printing style which more closely associates errors and their notes (similar to what's described here). While I'm rewriting the printing code, I'd like to be able to support printing highlights and fix-its for source lines which include non-ASCII characters since the current implementation just drops them, which can be really confusing.

Unfortunately, to do that I think I probably need to do grapheme cluster breaking in the compiler to map a byte from the source line to a column number (I know grapheme clusters don't always map 1:1 to columns, but I think they're close enough). LLVM provides llvm::sys::locale::columnWidth which Clang uses for it's printed output, but it seems like it just hardcodes some known sequences and doesn't handle a lot of cases correctly.

I guess my question is, does this seem like something even worth attempting? I took a look around the lexer and it doesn't seem like it has this functionality implemented, and I know linking ICU into the compiler is a non-starter.

I suppose this might be a good motivator to start integrating Swift code into the compiler, but I don't think we're quite ready for that yet :)

This might not be close enough, even after solving the grapheme cluster problem, given that it won't work for most east asian characters and emoji. Given that the latter tends to be used in informal tutorial code, it would be nice to give good error messages. Having a quick look around, it looks like most people seem to solve this in the way that LLVM has, i.e. by generating code using the Unicode width tables. At this rate the compiler is going to end up with a lot of piecemeal reimplementations of ICU though, so I don't know if there's a structural solution here.

I don't think that's quite correct, for example the emoji ":family_man_man_boy_boy:" is a single extended grapheme cluster even though it can be broken down into multiple, and it fits in a single column. The same holds true for most other emoji afaik.

I don't think this is a concern, I don't know of any ICU algorithms the compiler reimplements right now.

The deeper problem right now, I think, is that the only way that any part of Swift (specifically, the standard library) knows how to depend on ICU is through the ICU that's installed on the system. This already causes a handful of issues today, both on Apple OSes and Linux:

  • On any OS, the length of a string (in Characters) depends on the version of the OS you're running, since grapheme clustering can change from one version of Unicode to the next.

  • On Linux, the prebuilt toolchains posted on swift.org are linked to a specific version of ICU, and if you're running a slightly different distribution of Linux that doesn't have an exactly matching version of ICU, it's more painful to get things running because you have to deal with those library versioning issues (or build your own toolchain from source against whatever version you have).

A while back I wanted to look at hacking support for compile-time-verified named Unicode literals ("\U{LATIN SMALL LETTER A}") into the compiler, but since there was no existing ICU dependency, I shelved it.

There's been some discussion elsewhere on the forums about removing the ICU dependency and just embedding the actual data tables that the standard library needs into the it to remove the version sensitivity. Once that happens, perhaps the compiler could use those as well? @Michael_Ilseman probably has more info on this.

I don't think that's such a big problem for diagnostics in the compiler, though. Worst case, if your source code uses some new unicode symbols that your system ICU doesn't know how to parse, you'll get incorrect column information and misaligned arrows - which is no worse than it is now:

Since we claim full UTF8 support for source code (types with emoji names and whatnot), the compiler obviously needs to understand those byte sequences to some degree for the text it outputs at compile-time. Especially for the ASCII-art diagrams which point to specific locations.

It's a problem if it would add a complex dependency that behaves differently on different systems, and one which the compiler and standard library team have apparently already said they'd like to remove elsewhere for the reasons mentioned above.

That's why I mentioned the work about pulling the data tables themselves into the standard library—it may be the case that it would be a lot more acceptable to link those into the compiler for diagnostic purposes over a hard dependency on a system library that is always a moving target.

This could be done with a call at run time, but for compile-time support we'd need the data tables available.

When the standard library has its own copy of the data tables and grapheme breaking algorithm, then (at least in theory) the compiler could use it at compilation time. This would also be useful for validating identifiers (right now the compiler makes a poor guess), as well as validating Character literals (right now the compiler is permissive).

Yup, lots of synergy comes with this.

Drop ICU dependency

3 Likes

I’m not an expert here, but it seems to me that the important thing is for the compiler’s output to match how the source code is displayed.

Obviously a programmer can use any text editor or IDE they want when viewing and editing source code, but in the vast majority of cases the program they use will itself rely on the Unicode implementation installed on their system.

Thus, the Swift compiler should use the system-provided Unicode implementation when calculating character positions, because those positions ought to match what the user sees in their source code editor of choice.

Thanks for the update on this! I think I’ll focus on just getting this working for ascii in the short term then. It sounds like the best time to tackle better compile-time Unicode support will be once the standard library has successfully dropped the ICI dependency, since we can hopefully reuse some of that work.

Yup. There are a lot of things piling up behind that bottleneck:


What exactly is it you need “columns” for?

Is this about reporting numbers that other tools will use to find the locations of errors? (Then just use UTF‐8 byte counts for simplicity.)

Or is it about getting the caret in the right place on the next line for things like this?

hello.swift:2:5: error: invalid redeclaration of 'y'
  1| let y = 0
   |     ^ note: 'y' previously declared here 
  2| var y = 0
         ^

If that is what you are aiming for, then there really is no solution, because the whole concept of monospace crashes and burns when you venture beyond ASCII. I would advise completely rethinking how the information is presented instead, and find a way to clearly highlight the range inline. Maybe for terminals it could use colours or underlining. Where colours aren’t available, you could enclose it in ornamental brackets of some kind. Or you could borrow from the Venda language and stick a low combining circumflex directly onto the character (◌̭ U+032D). The point is, if you make it stand out inline, then you do not need column offsets at all.

Nope, it is a single extended grapheme cluster but not a single column. Try it out in your closest terminal emulator and you'll find this not to be true (unless its Unicode support is very broken). This is why wcwidth is in the C standard library (though hopelessly out of date in most cases, as I understand it) and part of the reason why there are Unicode tables for how wide a character is. e.g. in macOS Terminal.app
14

This is technically true, but you can easily get a 99.9% solution just by following the Unicode tables, like basically every terminal emulator does.

Right to left scripts.

Okay, knock a 9 off if you like. Again, I admit it's technically true, and there are other exceptions (characters listed as ambiguous width, terminal emulators with poor or outdated Unicode support, etc). But you can still do a best-effort job on alignment, helping the vast majority of users, while simultaneously using colour or similar as a fallback.

This is not what those “width” properties are for.

Unicode Standard Annex 11: East Asian Width

When dealing with East Asian text, there is the concept of an inherent width of a character. This width takes on either of two values: narrow or wide. [...]

Layout and line breaking (to cite only two examples) in East Asian context show systematic variations depending on the value of this East_Asian_Width property. Wide characters behave like ideographs; they tend to allow line breaks after each character and remain upright in vertical text layout. Narrow characters are kept together in words or runs that are rotated sideways in vertical text layout.

Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

I sincerely think it is better to aim for a solution that does not even need to attempt aligning separate lines. Inline highlighting works out of the box for 100% of Unicode.

It might not be intended for that, but that's what it is very commonly used for. And, again, “all situations” is not what was claimed, and alignment can be used in conjunction with inline colouring or underlining as I said. Right-to-left scripts generally render poorly in terminal emulators, and there are also people with various forms of colour-blindness, and probably terminal emulators where underlining doesn't work well, and definitely terminal emulators where the low-combining circumflex won't render correctly, or are monochrome, etc. But I didn't quote you and sarcastically write a bunch of ellipses between words in response.

I apologize for how I wrote it.

1 Like