z/OS, Swift, and encodings

I've been part of the team at IBM that has been porting Swift to z/OS. Although we have a working version of the compiler and runtime, we’ve had to implement some horrible hacks to get there and we’re now in the midst of trying to “right our technical wrongs”. If we don’t there is no way our code can possibly be pushed upstream and we will forever be downstream consumers, complete with constant merge pains. Some backstory, quickly. z/OS is the operating system on IBM’s mainframe systems (“z Systems” or simply “z”). We had to port LLVM and Clang (obviously) and implement the backend for the z architecture, although most of the work was done previously by a team that ported LLVM/Clang to Linux on z. The LLVM backend is where most of our code changes have been applied. The changes in Swift are small in comparison. However, there is one massive elephant in the room: EBCDIC. The native encoding on mainframes is EBCDIC. In order to make any progress at the beginning of the project, we took the huge -- and arguably necessary -- step to change the internal representation of strings, symbols, and the like to be EBCDIC. Our hacks revolve around this. We kept Swift strings themselves in Unicode, but assumed all Swift source code (and LLVM IR, SIL, etc.) was EBCDIC and converted accordingly. It was ugly, but it worked. Obviously this violates the Swift spec and is no good if you attempt to pull in code from other sources, say, via the package manager. We are now working to eliminate this hack. As such, we are starting on the assumption that all (textual) input must be converted to UTF-8. It’s “UTF-8 inside”. Any conversions to other codesets are done at system boundaries. Input must come with a codeset and convert if necessary before being processed, and output may be converted is required (such as messages to stderr). We are working on a solution now that is minimally invasive and will have little to no performance impact on other platforms.

This was the most reasonable approach we could come up with. Demanding that all input and output be Unicode means that anyone editing files on a z system will have a hard time. It also makes development very difficult and tedious, for example, when reading intermediate file output.

All this leads me to some questions/points.

1) Does the skeletal outline provided above seem reasonable to others? Are we missing something really important?

2) String and character literals in C++ source code are one of our biggest issues. The only C++11 compliant compiler for z/OS is an internal version of IBM’s XL C/C++ compiler. It only handles EBCDIC, currently. This means the literals in C++ source end up as EBCDIC. If you convert the input Swift processes to UTF-8, then comparisons to such literals will fail. The solution we like the most is to use the C++11/C++17 feature of a ‘u8’ prefix on all string and character literals. It would be a huge change, but makes the encoding of literals explicit and involves no extra build configuration. Without the prefix, we have to resort to much build hackery by defining our own pre-processor. If anyone has any ideas or tools that could help in this regard, we’d appreciate some input.

3) Obviously this is not limited to Swift code; we have to touch LLVM and Clang libraries. Are those mailing lists better places to discuss this?

-- Geoff