Localization of Compiler Diagnostic Messages

Jumhyn · May 13, 2020, 6:06pm

What does that process gain over, say, just removing all of the old translations when a diagnostic message changes? Coming up with a new ID for a minor change in text seems like overkill, and will potentially pollute the ID namespace with my_diag_2 or my_diag_updated_5_13_2020.

xedin · May 13, 2020, 6:13pm

Yeah, it seems like changing diagnostic format would effectively invalidate diagnostics of all languages which means we could just remove translations in one place and re-translate. But in practice this is not that common, argument types could be adjusted e.g. from Identifier to DeclName but mostly once diagnostic is added it doesn't get changed that often.

Chris_Lattner3 · May 13, 2020, 8:52pm

Sure, that also works. I was just giving an example of a flow - leaving around dead translations serves no useful purpose. Having a decoupling like I described is mostly useful if the translations get updated asynchronously or separately. If they are all in a single monorepo, then there is no advantage I can see.

-Chris

phoneyDev · May 13, 2020, 11:17pm

Your translators will ask "One hump or two" if you start talking about yaml files. You need to use a simple and standard format for the translations themselves, like simple strings file or XLIFF or something else. You then need to have a tool that generates that file format for the translator to use and another tool to import the translations back from the translator into either your yaml files or whatever is your canonical file or files for the translations.

Asking translators to edit a yaml file in a text editor is only asking for trouble. Translators have software that lets them edit XLIFF files without the possibility of messing up the xml. Even simple strings files will come back from the translators with formatting errors.

HassanElDesouky · May 13, 2020, 11:46pm

Firstly, I think this would be overkill. Secondly, I don't think "Asking translators to edit a yaml file in a text editor is only asking for trouble." Because if we used a YAML schema like @Chris_Lattner3 and @xedin proposed. Which is:

The translator will clearly know what he should do. What do you think?

I already thought of that. I came to the conclusion that formatting errors will most likely occur only in RTL languages, and supporting RTL languages is currently out of the project's scope.

HassanElDesouky · May 13, 2020, 11:53pm

Continuing the discussion from Localization of Compiler Diagnostic Messages:

I think however using only one file that has all diagnostics without splitting them down would be better user experience. Because if you are someone that knows English and French for example and want to translate diagnostic messages, I don't think that it's convenient to open two files and just to look up the original message in English and then translate it in the other file. What do you think?

xedin · May 13, 2020, 11:55pm

Sure! I think we can start there and decide as we go.

akyrtzi · May 13, 2020, 11:59pm

IMHO it is more straightforward and convenient to open 2 files in an editor and have them side-by-side than figure out how to have side-by-side sections from the same file.

phlebotinum · May 14, 2020, 12:14am

This is 100% true. In game development this is a no-brainer. You might say in a coder only software development environment you handle this more "hardcore" to avoid the tooling efforts. After all there are people who think an IDE is for losers ; )

phoneyDev · May 14, 2020, 12:20am

Not a chance. The translator is not a technical person. They don't know what is important in that file and they have no compiler giving them error messages when they make a mistake. They don't need to know about yaml file formats. That's your job.

My experience with strings files and translators is that when I get them back I always have to verify that the format is correct. Often enough a quote is missing or some other part of the formatting is wrong.

Why not go with strings files like this:

"english string 1" = "english string 1"      // To the translator A
"english string 2" = "english string 2"

"english string 1" = "translated string 1" // From the translator B
"english string 2" = "translated string 2"

You generate a file containing lines like A with all the english strings and send it to the translator. They translate all the strings and send it back. You import B into your yaml file. You need automated ways to export A and import B. (For strings that were previously translated file A contains those translated strings on the right side instead of the English strings.)

Oh, and notice the translator doesn't need to open two text editor windows.

phlebotinum · May 14, 2020, 12:24am

If you get only developers to translate it can work, but still decent tooling improves the process, regardless who does it.

HassanElDesouky · May 14, 2020, 1:06am

I'm sorry, I don't think I fully understand you. How is it going to be "side-by-side sections from the same file" when you are editing in the same file as the following:

- <diagnostic_id>: 
    en: "..."
    fr: "..."
    ...

You will only be looking at the same diagnostic and won't need to open any thing side by side... Is that right or I'm missing something?

akyrtzi · May 14, 2020, 1:13am

My bad, I misunderstood the format, though I'd consider it worthwhile to avoid loading all the translations unnecessarily (as @xedin mentioned).

HassanElDesouky · May 18, 2020, 1:53pm

Week 1 Progress Update

Hi everyone, I enjoyed reading your comments about my approach regarding my project.

Here's what I've done last week:

Opened a discussion on the forums.
Decided a YAML file format.
Ported diagnostics from .def files to the YAML file.
Implemented -locale flag to use diagnostic messages from YAML file.

Currently doing the following:

Implementing a diagnostic-messages-path flag to get the directory of the diagnostic message folders. For development purposes.
Implementing a diagnostic message retrieval method that works for both .def and .yaml files.

Here's my approach regarding what I'm currently working on, I'd love to take your opinion on it:

Implementing a `diagnostic-messages-path` flag

Currently, I have created a directory at include/swift/AST/Diagnostics then I created two files there (en.yaml & fr.yaml). My current method of retrieving the files is by writing down their full path on my hard disk, which of course, isn't the right way.

After discussing the best approach for this with @owenv. We think the best approach for this is to add a frontend flag to the compiler to override the path to the YAML files directory and use this when developing to prevent the compiler from looking relative to the main executable. This is like the -diagnostic-documentation-path option does for educational notes.

Implementing a diagnostic message retrieval method

The current way of retrieving diagnostic messages is by querying by position in diagnosticStrings[]. However, this won't work with the YAML file, because when adding a new diagnostic message you will need to add it to the right index in both the .def and the .yaml files.

My solution for this is to retrieve the diagnostic message not by position but with ID. So, maybe create a map that maps IDs to messages.

Finally, I'd love if you can give me any feedback on my approach and my progress so far!

xedin · May 18, 2020, 7:47pm

Great progress, @HassanElDesouky!

If you look at DiagnosticEngine.h you'll see that DiagID is defined as enum : uint32_t so when diagnostic ids are loaded from .def (in DiagnosticList.cpp) they are going to be stored as cases in that enum that's why it's easy to just build an array of them at the moment.

I think we can extended that scheme to YAML as well since, as we discussed, diagnostic ids and signatures are going to be loaded/verified using .def file(s). I seems like abstract interface for diagnostic retrieval should be based on DiagID because it's easy to convert it to a number when needed...

HassanElDesouky · May 25, 2020, 12:20am

Week 2 Progress Update

Hi everyone, first of all, Eid Fitr Mubarak for you all!

This will be a very fast weekly update to just keep you updated. I successfully completed the two things I wanted to do last week which they were:

Implemented a diagnostic-messages-path flag to get the directory of the diagnostic message folder.
Implemented a diagnostic message retrieval method that works for both .def and .yaml files.

Because of Eid El-Fitr, I won't do much this week. So, if I have time I'll try to do the following:

Refactor DiagnosticEngine and implement YAMLDiagnosticProvider field in DiagnosticEngine.

HassanElDesouky · June 2, 2020, 5:58am

Community period

First of all, I've had a really good time engaging with the community and I do love it! I think I did a pretty good progress in the first month, and I'm excited to continue working on the project!

In brief, here's what I did last month:

I think I'm almost done with the main deliverables

Engaged with the Swift community more. e.g. posted on the forums weekly updates, and communicated effectively with the people in the community.
Got familiarized more with the code base, especially how the DiagnosticEngine works!
Learned new OOP concepts in C++.
Implemented compiler's frontend flags, e.g. -locale and -diagnostic-messages-path.
Implemented a way of retrieval of diagnostic messages from the YAML file.
Took care of diagnostic messages fallbacks if the diagnostic message's language isn't supported.
Refactored DiagnosticEngine to support the new YAML diagnostics format.

Plans for this month:

Disclaimer, this is my own stretch goals which I may not be able to finish this month (but I'll try).

Start working on PRs, and try to get the changes merged.
Handle tests. e.g. how to test and what to test.
Get familiar with LLVM data structure libraries.
Tackle the stretch goals.
Start implementing the YAML serialization.

HassanElDesouky · June 22, 2020, 4:40am

Hi everyone,

Here's an update about for what I've been doing in the last two weeks:

I've been focusing on improving my C++ skills.
Worked mainly on PR #32239 for refactoring the DiagnosticEngine to use YAML files.
Discussed testing, lint and prune tools with my mentor.
Got a little bit familer with LLVM data structure liberaries and LLVM code base in general.

Plans for the rest of the month:

Write tests for localization.
Open a PR for frontend flags and hopefully get it to be merged.

Finally, I'd like to thank my mentor. He's been a great help for me and he's always been very responsive and nice. So, thank you @xedin :)

HassanElDesouky · June 29, 2020, 4:16pm

Hi everyone, the first month of the GSoC coding period is almost finished so here's what I did.

Most of my time was working on getting the first PR merged which was PR #32483, in which I introduced localization support for diagnostics via file-per-language store in YAML format.

And now I'm working on my second PR #32568 which will introduce frontend flags for localization as well as writing some tests to make sure we are getting something other than the normal English messages.

For the next month of the coding period, I think, I'll work on creating lint and prune tools for localization and maybe starting tackling the stretch goals if I had time.

HassanElDesouky · July 28, 2020, 3:53pm

Hi everyone, this month (second coding period) I was mainly working on creating a serialized format for the yaml files.
https://github.com/apple/swift/pull/33022
In this PR I'm serializing YAML to an LLVM::OnDiskHashTable format. I also created a tool that will handle serialization of YAML file to the .db OnDiskHashTable format. I think this PR should be merged today or tomorrow.

What's still remaining in the project is:

Removing text messages from .def files and only use the new formats for retrieving diagnostics.
Create prune and lint tools for the YAML file.