Hello everyone, I hope you are doing well and staying safe :)
My name is Hassan ElDesouky, I'm a third-year CS student in a university in Egypt. I have a background in iOS development and competitive programming. I also have a blog on Medium and my website, you can check it at heldesouky.xyz. Finally, I'm an active member of the Swift community in Egypt as I'm a part of the SwiftCairo meetup.
I will be working on the Localization of Compiler Diagnostic Messages as part of GSoC 2020 with @xedin as my mentor.
This post is intended to be a conversation starter to discuss my following approach with all of you. Therefore, please don't forget to give me any helpful feedback.
Why Localizing the Compiler Diagnostic Messages?
Diagnostics play a vital role in a programming language experience. It’s vital for developer productivity that the compiler can produce proper guidance in any situation, especially incomplete or invalid code through messages. Currently, diagnostic messages are only available in English which limits their usefulness to anybody not proficient enough with the language.
As a step towards the goal of making Swift programming language more accessible for non-English speakers, this proposal extends the Swift compiler to produce localized diagnostic messages.
Goal
Currently, the Swift compiler diagnostic definitions are hard-coded in .def files e.g., DiagnosticsSema.def, and this has many disadvantages like:
- The format is not extensible e.g. it's impossible to express localized messages.
- The format is not accessible.
- The format is hard for non-technical people to use.
- Filles are pre-processed into another format while the compiler itself is built. Therefore you can’t dynamically change the diagnostic file.
The goal of this project is to change that so diagnostic messages could be stored in a more accessible format like YAML to make it easy for non-technical people to contribute new translations.
Why YAML
?
We choose YAML
mainly because of two reasons:
YAML
is a human friendly and a more accessible format.LLVM
already has facilities to read that format out of the box.
Agenda and Deliverables
Main Goals
Deliverable |
---|
Open a discussion on the forums about the project. |
Decide on a format of .yaml file and it’s location in repository and toolchain. |
Port diagnostics from .def files into a new YAML file. |
Refactor how diagnostic definitions are stored in DiagnosticEngine so it’s easier to verify that new diagnostic format has all of the diagnostics mentioned in the source. |
Implement retrieval of individual diagnostics from the new diagnostic format. |
Add a flag (e.g. -locale ) which switches compiler to always use the new diagnostic format |
Remove all of the old Diagnostics*.def files |
Stretch Goals
Deliverable |
---|
Serialize YAML into a binary format before loading into the toolchain |
Figure out what the format might be that would be efficient for diagnostic retrieval |
Implement dual operation mode - directly from YAML or its binary form. |
Note: Implementing the dual operation mode will make it easy to modify diagnostics without needing to re-build the compiler
According to schedule and deliverables I've been researching following areas...
Description of the Change
1. YAML
format
Every diagnostic in the current Diagnostic*.def
files are structured like this:
*KIND*(*diagnostic_identifier*,*diagnostic_options*,
"*diagnostic_message*", (*diagnostic_signature*))
The suggested .yaml
format is:
message_identifier: # diagnostic_identifier
- kind: Error # The type/ kind of the diagnostic e.g. (Error, Warning, Note, etc)
- options: none
- signature: null
- languages: # The defferent langagues this message supports/ translated to
- en: "Error"
- fr: "Erreur"
In case of a diagnostic like the following which has a placeholder for the signatures that will be placed inside the diagnostic message. The current way of handling this is through DiagnosticEngine
as DiagnosticEngine
handles replacement of %<num>
into actual failures when the diagnostic message is formed form InFlightDiagnostic
object. Therefore we won't need to handle this in YAML
e.g. by using YAML
anchors and the DiagnosticEngine
will handle it as it's doing right now.
ERROR(could_not_find_enum_case,none,
"enum type %0 has no case %1; did you mean %2?",
(Type, DeclNameRef, DeclName))
Proposed YAML
file:
could_not_find_enum_case:
- kind: Error
- options: none
- signature:
- Type
- DeclNameRef
- DeclName
- languages:
- en: "enum type %0 has no case %1; did you mean %2?"
- fr: "...."
1.1 New diagnostic format placement
First, in the repository, I will create a new subdirectory at include/swift/AST
called Diagnostics
and I'll put the Diagnostics.yaml
file in it.
Second, I'll create a CMake
file at include/swift/AST
and it will handle copying the Diagnostics.yaml
to its location in the toolchain.
swift_install_in_component(DIRECTORY Diagnostics
DESTINATION "share/"
COMPONENT compiler)
2. Approach to refactoring of the current format
Currently, the diagnostic messages are parsed from the Diagnostic*.def
files into an array of strings-- see DiagnosticEngine.cpp#L94-L101 which is then queried by position in DiagnosticEngine::diagnosticStringFor
.
My first refactor for the DiagnosticEngine
will be:
- Parsing the diagnostic message information from the
Diagnostics.yaml
file to anstd::vector
of objects. - We can also iterate over the
std::vector
and fill thediagnosticStrings[]
with the diagnosticstring
messages.
The space complexity of this algorithm will be
O(DiagnosticYAMLNode)
for creating thestd::vector<DiagnosticYAMLNode>
and for parsing the time complexity will beO(n)
.If we choose to work with the same
diagnosticStrings[]
and not use newly createdstd::vector<DiagnosticYAMLNode>
, then will creating this array will takeO(n)
more on space and time to create.
Please, know that this is not the best solution as we can improve the time and space complexity.
3. Retrieval of individual diagnostics methods.
Since I will be using the same diagnosticStrings[]
or something similar for storing the diagnostic messages, I'll use the current method for retrieving diagnostic messages which are querying by position.
4. New frontend flag
I'll create a flag just like @owenv did with -debug-diagnostic-names
frontend flag for using the newly created Diagnostic.yaml
format.
5. Efficient diagnostic storage
Performance concerns are not a problem right now in diagnostic messages because all of the strings are included with the compiler as a static collection. For a large project, there would be multiple compiler invocations and it might not be the best idea to re-parse the same YAML
file over and over again.
A way to improve the time complexity is to split the Diagnostic.yaml
file into multiple files; as each file will be a diagnostics for a particular language. Then I'll serialize each file to LLVM::BitStream
code file.
6. Efficient file format for the binary file
The time complexity for reading from the YAML
file is O(n)
and the time complexity for reding the LLVM::BitStream
is also O(n)
. Therefore, we need to make a custom structure for the LLVM::BitStream
file to improve the indexing time complexity.
We know that data inside LLVM::BitStream
is stored in the form of blocks, which define regions of the file, and records, which contain data fields that can be up to 64 bit. Every block has a key field, which helps it to be recognized uniquely.
We can reduce the time to access a diagnostic message by using multilevel indexing based data structure. Therefore, I'll use a B+ Tree. Using a B+ Tree as the file structure for the LLVM::BitStream
file will improve the lookup for the diagnostic message from O(n)
to O(log(n))
.
You can read more about the B+ Tree implementation also on my GSoC proposal on Google Docs.
7. Implement dual operation mode
Still trying to figure it out.
GSoC Original Proposal
For more details on the solution and the performance cost, make sure to check out the GSoC proposal on Google Docs.