Hello everyone, I hope you are doing well and staying safe :)
My name is Hassan ElDesouky, I'm a third-year CS student in a university in Egypt. I have a background in iOS development and competitive programming. I also have a blog on Medium and my website, you can check it at heldesouky.xyz. Finally, I'm an active member of the Swift community in Egypt as I'm a part of the SwiftCairo meetup.
This post is intended to be a conversation starter to discuss my following approach with all of you. Therefore, please don't forget to give me any helpful feedback.
Why Localizing the Compiler Diagnostic Messages?
Diagnostics play a vital role in a programming language experience. It’s vital for developer productivity that the compiler can produce proper guidance in any situation, especially incomplete or invalid code through messages. Currently, diagnostic messages are only available in English which limits their usefulness to anybody not proficient enough with the language.
As a step towards the goal of making Swift programming language more accessible for non-English speakers, this proposal extends the Swift compiler to produce localized diagnostic messages.
Currently, the Swift compiler diagnostic definitions are hard-coded in .def files e.g., DiagnosticsSema.def, and this has many disadvantages like:
- The format is not extensible e.g. it's impossible to express localized messages.
- The format is not accessible.
- The format is hard for non-technical people to use.
- Filles are pre-processed into another format while the compiler itself is built. Therefore you can’t dynamically change the diagnostic file.
The goal of this project is to change that so diagnostic messages could be stored in a more accessible format like YAML to make it easy for non-technical people to contribute new translations.
YAML mainly because of two reasons:
YAMLis a human friendly and a more accessible format.
LLVMalready has facilities to read that format out of the box.
Agenda and Deliverables
|Open a discussion on the forums about the project.|
|Decide on a format of .yaml file and it’s location in repository and toolchain.|
|Port diagnostics from
|Refactor how diagnostic definitions are stored in
|Implement retrieval of individual diagnostics from the new diagnostic format.|
|Add a flag (e.g.
|Remove all of the old
|Figure out what the format might be that would be efficient for diagnostic retrieval|
|Implement dual operation mode - directly from
Note: Implementing the dual operation mode will make it easy to modify diagnostics without needing to re-build the compiler
According to schedule and deliverables I've been researching following areas...
Description of the Change
Every diagnostic in the current
Diagnostic*.def files are structured like this:
*KIND*(*diagnostic_identifier*,*diagnostic_options*, "*diagnostic_message*", (*diagnostic_signature*))
.yaml format is:
message_identifier: # diagnostic_identifier - kind: Error # The type/ kind of the diagnostic e.g. (Error, Warning, Note, etc) - options: none - signature: null - languages: # The defferent langagues this message supports/ translated to - en: "Error" - fr: "Erreur"
In case of a diagnostic like the following which has a placeholder for the signatures that will be placed inside the diagnostic message. The current way of handling this is through
DiagnosticEngine handles replacement of
%<num> into actual failures when the diagnostic message is formed form
InFlightDiagnostic object. Therefore we won't need to handle this in
YAML e.g. by using
YAML anchors and the
DiagnosticEngine will handle it as it's doing right now.
ERROR(could_not_find_enum_case,none, "enum type %0 has no case %1; did you mean %2?", (Type, DeclNameRef, DeclName))
could_not_find_enum_case: - kind: Error - options: none - signature: - Type - DeclNameRef - DeclName - languages: - en: "enum type %0 has no case %1; did you mean %2?" - fr: "...."
1.1 New diagnostic format placement
First, in the repository, I will create a new subdirectory at
Diagnostics and I'll put the
Diagnostics.yaml file in it.
Second, I'll create a
CMake file at
include/swift/AST and it will handle copying the
Diagnostics.yaml to its location in the toolchain.
swift_install_in_component(DIRECTORY Diagnostics DESTINATION "share/" COMPONENT compiler)
2. Approach to refactoring of the current format
Currently, the diagnostic messages are parsed from the
Diagnostic*.def files into an array of strings-- see DiagnosticEngine.cpp#L94-L101 which is then queried by position in
My first refactor for the
DiagnosticEngine will be:
- Parsing the diagnostic message information from the
Diagnostics.yamlfile to an
- We can also iterate over the
std::vectorand fill the
diagnosticStringswith the diagnostic
The space complexity of this algorithm will be
O(DiagnosticYAMLNode)for creating the
std::vector<DiagnosticYAMLNode>and for parsing the time complexity will be
If we choose to work with the same
diagnosticStringsand not use newly created
std::vector<DiagnosticYAMLNode>, then will creating this array will take
O(n)more on space and time to create.
Please, know that this is not the best solution as we can improve the time and space complexity.
3. Retrieval of individual diagnostics methods.
Since I will be using the same
diagnosticStrings or something similar for storing the diagnostic messages, I'll use the current method for retrieving diagnostic messages which are querying by position.
4. New frontend flag
5. Efficient diagnostic storage
Performance concerns are not a problem right now in diagnostic messages because all of the strings are included with the compiler as a static collection. For a large project, there would be multiple compiler invocations and it might not be the best idea to re-parse the same
YAML file over and over again.
A way to improve the time complexity is to split the
Diagnostic.yaml file into multiple files; as each file will be a diagnostics for a particular language. Then I'll serialize each file to
LLVM::BitStream code file.
6. Efficient file format for the binary file
The time complexity for reading from the
YAML file is
O(n) and the time complexity for reding the
LLVM::BitStream is also
O(n). Therefore, we need to make a custom structure for the
LLVM::BitStream file to improve the indexing time complexity.
We know that data inside
LLVM::BitStream is stored in the form of blocks, which define regions of the file, and records, which contain data fields that can be up to 64 bit. Every block has a key field, which helps it to be recognized uniquely.
We can reduce the time to access a diagnostic message by using multilevel indexing based data structure. Therefore, I'll use a B+ Tree. Using a B+ Tree as the file structure for the
LLVM::BitStream file will improve the lookup for the diagnostic message from
You can read more about the B+ Tree implementation also on my GSoC proposal on Google Docs.
7. Implement dual operation mode
Still trying to figure it out.
GSoC Original Proposal
For more details on the solution and the performance cost, make sure to check out the GSoC proposal on Google Docs.