Localization of Compiler Diagnostic Messages

HassanElDesouky · May 12, 2020, 5:33pm

Hello everyone, I hope you are doing well and staying safe :)

My name is Hassan ElDesouky, I'm a third-year CS student in a university in Egypt. I have a background in iOS development and competitive programming. I also have a blog on Medium and my website, you can check it at heldesouky.xyz. Finally, I'm an active member of the Swift community in Egypt as I'm a part of the SwiftCairo meetup.

I will be working on the Localization of Compiler Diagnostic Messages as part of GSoC 2020 with @xedin as my mentor.

This post is intended to be a conversation starter to discuss my following approach with all of you. Therefore, please don't forget to give me any helpful feedback.

Why Localizing the Compiler Diagnostic Messages?

Diagnostics play a vital role in a programming language experience. It’s vital for developer productivity that the compiler can produce proper guidance in any situation, especially incomplete or invalid code through messages. Currently, diagnostic messages are only available in English which limits their usefulness to anybody not proficient enough with the language.

As a step towards the goal of making Swift programming language more accessible for non-English speakers, this proposal extends the Swift compiler to produce localized diagnostic messages.

Goal

Currently, the Swift compiler diagnostic definitions are hard-coded in .def files e.g., DiagnosticsSema.def, and this has many disadvantages like:

The format is not extensible e.g. it's impossible to express localized messages.
The format is not accessible.
The format is hard for non-technical people to use.
Filles are pre-processed into another format while the compiler itself is built. Therefore you can’t dynamically change the diagnostic file.

The goal of this project is to change that so diagnostic messages could be stored in a more accessible format like YAML to make it easy for non-technical people to contribute new translations.

Why `YAML`?

We choose YAML mainly because of two reasons:

YAML is a human friendly and a more accessible format.
LLVM already has facilities to read that format out of the box.

Agenda and Deliverables

Main Goals

Deliverable
Open a discussion on the forums about the project.
Decide on a format of .yaml file and it’s location in repository and toolchain.
Port diagnostics from `.def` files into a new YAML file.
Refactor how diagnostic definitions are stored in `DiagnosticEngine` so it’s easier to verify that new diagnostic format has all of the diagnostics mentioned in the source.
Implement retrieval of individual diagnostics from the new diagnostic format.
Add a flag (e.g. `-locale`) which switches compiler to always use the new diagnostic format
Remove all of the old `Diagnostics*.def` files

Stretch Goals

Deliverable
Serialize `YAML` into a binary format before loading into the toolchain
Figure out what the format might be that would be efficient for diagnostic retrieval
Implement dual operation mode - directly from `YAML` or its binary form.

Note: Implementing the dual operation mode will make it easy to modify diagnostics without needing to re-build the compiler

According to schedule and deliverables I've been researching following areas...

Description of the Change

1. `YAML` format

Every diagnostic in the current Diagnostic*.def files are structured like this:

*KIND*(*diagnostic_identifier*,*diagnostic_options*,
      "*diagnostic_message*", (*diagnostic_signature*))

The suggested .yaml format is:

message_identifier: # diagnostic_identifier
    - kind: Error # The type/ kind of the diagnostic e.g. (Error, Warning, Note, etc)
    - options: none 
    - signature: null
    - languages: # The defferent langagues this message supports/ translated to
        - en: "Error"
        - fr: "Erreur"

In case of a diagnostic like the following which has a placeholder for the signatures that will be placed inside the diagnostic message. The current way of handling this is through DiagnosticEngine as DiagnosticEngine handles replacement of %<num> into actual failures when the diagnostic message is formed form InFlightDiagnostic object. Therefore we won't need to handle this in YAML e.g. by using YAML anchors and the DiagnosticEngine will handle it as it's doing right now.

ERROR(could_not_find_enum_case,none,
      "enum type %0 has no case %1; did you mean %2?",
      (Type, DeclNameRef, DeclName))

Proposed YAML file:

could_not_find_enum_case:
    - kind: Error
    - options: none 
    - signature:
    		- Type
    		- DeclNameRef
    		- DeclName
    - languages:
        - en: "enum type %0 has no case %1; did you mean %2?"
        - fr: "...."

1.1 New diagnostic format placement

First, in the repository, I will create a new subdirectory at include/swift/AST called Diagnostics and I'll put the Diagnostics.yaml file in it.

Second, I'll create a CMake file at include/swift/AST and it will handle copying the Diagnostics.yaml to its location in the toolchain.

swift_install_in_component(DIRECTORY Diagnostics
                           DESTINATION "share/"
                           COMPONENT compiler)

2. Approach to refactoring of the current format

Currently, the diagnostic messages are parsed from the Diagnostic*.def files into an array of strings-- see DiagnosticEngine.cpp#L94-L101 which is then queried by position in DiagnosticEngine::diagnosticStringFor.

My first refactor for the DiagnosticEngine will be:

Parsing the diagnostic message information from the Diagnostics.yaml file to an std::vector of objects.
We can also iterate over the std::vector and fill the diagnosticStrings[] with the diagnostic string messages.

The space complexity of this algorithm will be O(DiagnosticYAMLNode) for creating the std::vector<DiagnosticYAMLNode> and for parsing the time complexity will be O(n).

If we choose to work with the same diagnosticStrings[] and not use newly created std::vector<DiagnosticYAMLNode>, then will creating this array will take O(n) more on space and time to create.

Please, know that this is not the best solution as we can improve the time and space complexity.

3. Retrieval of individual diagnostics methods.

Since I will be using the same diagnosticStrings[] or something similar for storing the diagnostic messages, I'll use the current method for retrieving diagnostic messages which are querying by position.

4. New frontend flag

I'll create a flag just like @owenv did with -debug-diagnostic-names frontend flag for using the newly created Diagnostic.yaml format.

5. Efficient diagnostic storage

Performance concerns are not a problem right now in diagnostic messages because all of the strings are included with the compiler as a static collection. For a large project, there would be multiple compiler invocations and it might not be the best idea to re-parse the same YAML file over and over again.

A way to improve the time complexity is to split the Diagnostic.yaml file into multiple files; as each file will be a diagnostics for a particular language. Then I'll serialize each file to LLVM::BitStream code file.

6. Efficient file format for the binary file

The time complexity for reading from the YAML file is O(n) and the time complexity for reding the LLVM::BitStream is also O(n). Therefore, we need to make a custom structure for the LLVM::BitStream file to improve the indexing time complexity.

We know that data inside LLVM::BitStream is stored in the form of blocks, which define regions of the file, and records, which contain data fields that can be up to 64 bit. Every block has a key field, which helps it to be recognized uniquely.

We can reduce the time to access a diagnostic message by using multilevel indexing based data structure. Therefore, I'll use a B+ Tree. Using a B+ Tree as the file structure for the LLVM::BitStream file will improve the lookup for the diagnostic message from O(n) to O(log(n)).

You can read more about the B+ Tree implementation also on my GSoC proposal on Google Docs.

7. Implement dual operation mode

Still trying to figure it out.

GSoC Original Proposal

For more details on the solution and the performance cost, make sure to check out the GSoC proposal on Google Docs.

owenv · May 12, 2020, 6:25pm

This looks fantastic!

I have a few very minor comments:

I recommend installing the diagnostic files in share/swift/ instead of share/, because the swift toolchain may be installed at the root of the file system on some platforms.
Last time I counted we had about 2200 diagnostic messages, split across 7-10 .def files. When they are migrated to the new YAML format, it may be worth thinking about how we can best organize them so editors can scan through them easily.
As you mentioned in the google doc, right now the signature of DiagnosticEngine::diagnose looks like this:

template<typename ...ArgTypes>
    InFlightDiagnostic 
    diagnose(DeclNameLoc Loc, Diag<ArgTypes...> ID,
        typename detail::PassArgument<ArgTypes>::type... Args)

Currently, the values of Diag<ArgTypes...> are generated via macros. When we switch to the new system, will these be strings instead, or will they be generated some other way?

This may be out of the scope of the GSoC project, but at some point there should be a policy document for adding and editing translations so that the different languages don't get out of sync.

Chris_Lattner3 · May 13, 2020, 1:17am

Hi Hassan,

I'm thrilled you're working on this!

HassanElDesouky:

The suggested .yaml format is:

message_identifier: # diagnostic_identifier
    - kind: Error # The type/ kind of the diagnostic e.g. (Error, Warning, Note, etc)
    - options: none 
    - signature: null
    - languages: # The defferent langagues this message supports/ translated to
        - en: "Error"
        - fr: "Erreur"

Out of curiosity, why are you working to duplicate the content in the .def file? The diagnostic identifier already provides a stable-ish identifier to latch onto. The idea of the diagnostics design is that you should be able to provide a localization by providing a catalog that maps between the diagnostic ID and a new string.

This means that your YAML file should just be something like this for a "imprecise english" translation catalog:

could_not_find_enum_case: "hey it doesn't look like %0 has a case named %1; dontcha think you should say %2 instead???"

This would mean that the diagnostics machinery would just have to open these catalogs, look up an identifier by its ID, and fall back to the standard english spelling if no hit in the catalog is available.

I'm happy to defer to others on this, but I think a hash table would be much simpler and probably more efficient. I also think that bitcode is likely to be overkill for a simple "ID to string" map.

-Chris

HassanElDesouky · May 13, 2020, 1:58am

Thank you for taking the time to respond to this.

I totally, agree there's no problem.

Of course, if you also have any ideas in particular regarding this I'd love to hear them.

I discussed this with my mentor earlier when I was writing my GSoC proposal. Until now, I think it will be validated using some global variables. Again, if you have any suggestions they will help a lot!

I think you are right. I didn't think of this. Maybe I'll add this as a stretch goal for now.

HassanElDesouky · May 13, 2020, 2:17am

Hi Chris,

Thank you! This means a lot.

To be honest, I didn't want to do a dramatic change at the beginning. Therefore, I'm trying to implement this one step at a time. I was thinking of even using the existing diagnosticStrings[] array so I will be doing minimum changes at a time.

That being said, I didn't think that doing the following would be valid, to be honest. Therefore, I'll try to think about this more, and hopefully, we can implement something simple like what you have suggested.

Chris_Lattner3:

This means that your YAML file should just be something like this for a "imprecise english" translation catalog:
could_not_find_enum_case: "hey it doesn't look like %0 has a case named %1; dontcha think you should say %2 instead???"

@xedin and I discussed this while I was writing my GSoC proposal. I suggested using an std::map<T, T> which is of course built on top of a BST and not a hash map or use an std::unordered_map<,> which is a hash map. However, the problem with this was that hash maps are not the best on-disk data formats because that requires seeks to access the data, and collisions are possible. Therefore, we went with the B+ Tree data structure. I'd love to take your thoughts on that.

Finally, thank you for taking the time to write such a detailed response. I appreciate it.

typesanitizer · May 13, 2020, 5:07am

I suggested using an std::map<T, T> which is of course built on top of a BST and not a hash map or use an std::unordered_map<,> which is a hash map. However, the problem with this was that hash maps are not the best on-disk data formats because that requires seeks to access the data, and collisions are possible. Therefore, we went with the B+ Tree data structure. I'd love to take your thoughts on that.

Usually, we use common data structures like hash tables, sets etc. from LLVM instead of the standard library, since the LLVM data structures are faster for common operations. Many of the currency LLVM data structures are documented in the programmer's manual. This particular section might be helpful in deciding which container to use.

It might make sense to start out with using a pre-existing data structure and later switch to a more sophisticated custom data structure (in this case, a B+ tree) in case you have time left and the performance is not good enough in practice.

xedin · May 13, 2020, 6:51am

@HassanElDesouky To add to this point, I think we should keep identifier and argument types in the .def file so we can still build collection of available ids with respective arguments into the existing abstraction in DiagnosticEngine which makes it easy to validate that diagnostic code uses only existing ids with correct arguments, but the strings themselves are going to be stored either in a single YAML file or one per language (there are pros and cons to each approach so we can decide as we go).

Diagnostic format could be simplified down to:

combined YAML:

- <diagnostic_id>: 
    en: "..."
    fr: "..."
    ...

or file per language e.g. en-US.yaml:

<diagnostic-id>: ...
...

Big advantage of splitting diagnostics into multiple files - it allows to load only necessary information to produce a diagnostic for specified locale, where single file means we'd have to load all of the translations.

HassanElDesouky · May 13, 2020, 2:34pm

That means we won’t delete the .def files, right?

xedin · May 13, 2020, 4:10pm

Yes, that's right but they'd still serve a very important purpose.

xedin · May 13, 2020, 4:16pm

Also it seems like keeping it at least for now would ease transition to the new format because we can keep .def files intact but use message only if -locale is not specified.

owenv · May 13, 2020, 4:27pm

I agree this will be nice for performance, but I think it will make maintenance harder. For example, if I want to change the wording of a diagnostic message, I would have to look through all the different translation files to find out if other languages also need to be updated.

I agree with keeping the .def files for now so that we don't need to change the diagnose API and the signatures are still type-safe. It should be easier to incrementally roll out the feature that way too. Once it's complete, I do think we should remove the strings from the .def files though so that they just describe the identifier, signature, and options of each diagnostic. Anybody who would want to change one of those three fields will need to update other parts of the compiler code too, so there's little value in being able to update them without rebuilding.

Chris_Lattner3 · May 13, 2020, 4:29pm

I think what you're looking for is a compact binary format that can be mmap'd in (e.g. using llvm::MemoryBuffer) and then directly indexed in an efficient way to do lookups. To do this, you ideally don't want a dynamically allocated in memory representation that shadows the content of the file.

Given you have a string -> string map, I'd recommend having a fixed string hash function (e.g. use the existing one in llvm lib/Support), and write out the file as something like:

int32_t: hash table size, "N"
hash_bucket_t * N
random string data

The hash_bucket_t would be layout out like this:

struct hash_bucket_t {
  // Offset from start of the file to the full name like "could_not_find_enum_case" to check for collisions.
  int32_t full_key_name_offset;   
  // Offset from start of the file to the value of the entry.
  int32_t value_offset;
}

This would make looking up an entry just a few loads from an mmapped file.

-Chris

Chris_Lattner3 · May 13, 2020, 4:31pm

This will always be a problem, because you don't/can't in general read all the translations. Any significant change to a diagnostic means that the identifier should be changed and new translations have to be done.

I'd recommend checking out how projects like GCC and Emacs handle translation, because they have mechanisms for dealing with this.

-Chris

xedin · May 13, 2020, 4:40pm

Can you clarify what do you mean by that? I think what @owenv is saying is that if we have a single file with format that lists all of the translations for a given identifier adding new diagnostics or modifying existing ones shouldn't be problematic because there is just one place to adjust, where in multi-file scheme, we'd potentially have to add new files to provide a translation and diagnostics identifiers are not going to be listed in the same order everywhere.

owenv · May 13, 2020, 4:48pm

The point I was trying to make was that with this file:

some-diagnostic:
  en: "..."
  fr: "..."

It's easy to see that if I want to change this message and I only speak English, I need to create a new diagnostic ID, whereas if the translations are in separate files and I'm in the en file and all I see is this:

some-diagnostic -> "..."

I have to look in every other translation file to see if the text is safe to update without renaming the ID.

I think @Chris_Lattner3's suggestion to see how other projects handle this is a good one, some may have developed some kind of linter or automated tooling to enforce best practices in this area. Even with the combined format and no additional tooling, it would be possible to make mistakes, just less likely.

Edit: It could also just become best practice to rename an ID any time the text changes, even if the string isn't translated yet. That seems tricky to enforce though

akyrtzi · May 13, 2020, 4:57pm

llvm/Support/OnDiskHashTable.h seems appropriate.

xedin · May 13, 2020, 5:02pm

Tooling around this is definitely important because we'd want to verify automatically that all ids in .def are present in YAML and strings are formatted correctly.

I think it might be reasonable to adopt a hybrid solution initially - a single YAML with all translations which is going to be split in per-locale binary file (format to be determined) in the toolchain. Which is going to satisfy both discoverability and performance requirements.

owenv · May 13, 2020, 5:32pm

, I like that idea and I agree.

To summarize, I think there are two main suggested changes to @HassanElDesouky's original proposal, neither of which should significantly change the overall design:

Investigate a binary format that could be used with per-language files generated from the proposed YAML format. As Chris mentioned, this probably can be simpler than an LLVM bitstream.
Keep the .def files with the ID and signature. This simplifies the implementation, maintains the type-safe diagnose API, and will hopefully make incremental progress easier.

Hopefully that captures everyone's suggestions accurately!

xedin · May 13, 2020, 5:38pm

Thank you, @owenv! That sounds good to me!

Chris_Lattner3 · May 13, 2020, 6:01pm

My point is that there is nothing you can do about this - if you don't speak french and japanese and chinese and klingon, how are you going to update the translation?

The solution to this (from other communities) is that you change the translation ID - so the compiler produces the new/changed diagnostic in English - and then do a "call for translation" for the message that others can fill in.

Of course, if the person making the change happens to speak other languages, then they can update the translation dictionary for the languages they know, but in general these updates should be asyncronous.

-Chris

Localization of Compiler Diagnostic Messages

Why Localizing the Compiler Diagnostic Messages?

Goal

Why YAML?

Agenda and Deliverables

Description of the Change

1. YAML format

1.1 New diagnostic format placement

2. Approach to refactoring of the current format

3. Retrieval of individual diagnostics methods.

4. New frontend flag

5. Efficient diagnostic storage

6. Efficient file format for the binary file

7. Implement dual operation mode

GSoC Original Proposal

Why `YAML`?

1. `YAML` format