Python 3 support for building Swift Toolchain

My knowledge of Python is rusty, but have you tried...

stdin = sys.stdin.read().encode().decode()
# ...
stdout, stderr = p.communicate(stdin.encode())

Unfortunately, that did not solve the problem either. However, it has illuminated a possible source of the discrepancy. I also noticed that GYB was working when compiling regular source and (intermittently) failing while compiling tests. So I put a print("STDIN: {}".format(sys.stdin.encoding)) in GYB. I noticed that for all calls in regular source it is UTF-8 for failing sources in test it is ANSI_X3.4-1968.

I am starting to think the real underlying issue is that something about the test structure is stripping the encoding information from Python.

If I recall correctly, the test infrastructure strips out all environment variables that aren't explicitly allowed. Maybe one of those is a locale variable that the test runner should set to something consistent?

That definitely turned out to be the case. Lit, which launches (all of?) the tests, is for sure stripping the locale/encoding information which means Python 3 falls back to ASCII.

I spent way more time on this than I care to admit but I think I arrived at a lit.cfg that actually allows Python 3 to see expected encodings.

config.environment['PYTHONIOENCODING'] = 'UTF-8'
for key in set(list(["LANG", "LC_ALL"])).intersection(os.environ):
    config.environment[key] = os.environ[key]

While that works, I am uncomfortable suggesting it because I don't know enough about the implications on Lit.

In the end, feel more comfortable with this:

diff --git a/utils/PathSanitizingFileCheck b/utils/PathSanitizingFileCheck
index 7cad167eb7f..5a521df2c8c 100755
--- a/utils/PathSanitizingFileCheck
+++ b/utils/PathSanitizingFileCheck
@@ -12,6 +12,7 @@
 from __future__ import print_function
 
 import argparse
+import io
 import re
 import subprocess
 import sys
@@ -72,7 +73,7 @@ constants.""")
     else:
         slashes_re = r'/'
 
-    stdin = sys.stdin.read()
+    stdin = io.open(sys.stdin.fileno(), 'r', encoding="utf-8").read()
 
     for s in args.sanitize_strings:
         replacement, pattern = s.split('=', 1)
@@ -83,13 +84,14 @@ constants.""")
                        replacement,
                        stdin)
 
     if args.dry_run:
         print(stdin)
         return 0
     else:
         p = subprocess.Popen(
             [args.file_check_path] + unknown_args, stdin=subprocess.PIPE)
-        stdout, stderr = p.communicate(stdin)
+        stdout, stderr = p.communicate(stdin.encode(encoding="utf-8"))
         if stdout is not None:
             print(stdout)
         if stderr is not None:
-- 
2.24.1

It is more or less the same thing, it is 2/3 compatible and the change is localized to just this file. It should not have side-effects on the entire test suite. Failing tests: 35. Closer still.

So, wait, can we do a status check? I am able to build Swift from master with Python 3:

[rolson@tachobuilderf28 bin]$ which python
/usr/bin/python
[rolson@tachobuilderf28 bin]$ python
Python 3.7.6 (default, Jan 30 2020, 09:44:41)
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
[rolson@tachobuilderf28 bin]$ ./swift --version
Swift version 5.3-dev (LLVM 55ab0064a0, Swift 5243d3c809)
Target: x86_64-unknown-linux-gnu
[rolson@tachobuilderf28 bin]$

This is straight from master from this morning; I'm wondering what is missing or what still needs to be done.

The minimal change seems reasonable. If we were going to set LANG and LC_ALL explicitly, I'd rather pick something for all systems rather than inherit the system locale.

(I'm not sure why the existing behavior would be different on different systems though, unless different distros compile Python 3 with different defaults?)

@Ron_Olson So I think, and you'll have to correct me if I am wrong, the system that made that build has Python 2 on it as well. The reason why I point that out is that there have been changes to the CMake configurations to explicitly find Python 2 and use it for tools that still require Python 2. Then explicitly find Python 3 and use it for tools that are now working with Python 3.

My guess is that if you build with CMake with verbose output the build/test tools that explicitly require Python, e.g., GYB and Lit, probably refer to /usr/bin/python2.7 or something similar and not /usr/bin/python (as they had been previously).

In my build environment when I build master (i.e., find_package(Python2 COMPONENTS Interpreter REQUIRED) ) it says: '/usr/bin/python2.7' /home/rlovelett/Source/swift-source/swift/utils/gyb.py .... If I patch CMake with find_package(Python3 COMPONENTS Interpreter REQUIRED) I get '/usr/bin/python3.8' /home/rlovelett/Source/swift-source/swift/utils/gyb.py ....

All that to say, it probably has less to do with the Python 3 incompatibilities having been fixed and more to do with tools that require Python 2 are now correctly using it.

The specific tool I'm talking about here, PathSanitizingFileCheck, as far as I know, only gets called in the test suite by Lit. So to experience this particular build issue you'd have to have (1) run the test suite (2) explicitly force CMake to use Python 3.

1 Like

@rlovelett - Lol, you're right; even though I had, at some point, explicitly removed all of Python 2, something must have added it back as a dependency. Okay, removed, and now back to patching. :slight_smile:

1 Like

The minimal change seems reasonable. If we were going to set LANG and LC_ALL explicitly, I'd rather pick something for all systems rather than inherit the system locale.

I agree with that too. I just do not believe I am qualified to judge what it should be. So my hack was to copy what my system encoding says.

(I'm not sure why the existing behavior would be different on different systems though, unless different distros compile Python 3 with different defaults?)

Let me preface this by saying I am not a Python expert. In fact, the only direct experience I have ever had writing/debugging Python has been in context of compiling Swift.

My understanding after reading a few documents, including PEP 538 and the locale module documentation, and running a few tests of my own. Python 3 is very consistent on encoding, strings are ASCII unless changed by the environment or some runtime call.

As far as I can tell, this was the case in Python 2. The only difference was that in Python 2 strings were also binary sequence types as well. In Python 3 they are one or the other. As a result things that were okay before, e.g., a byte not handled by the default encoding scheme, could still be processed as bytes so long as it was never "encoded".

The interchangeability of string/bytes in Python 2 is (ab)used in many testing utilities. For instance PathSanitizingFileCheck, I've since realized that the patch above is not right for all tests because in some cases PathSanitizingFileCheck is purposely sent invalid UTF-8 characters. In cases like that at first glance the syntax seemed to indicate things were strings; upon running and fixes more tests it became clear the code was really treating it as bytes.

With Python 3 changes have to be made to make it clear that it is either bytes or a string.

Or that, is what I understand as of right now, with 15 more tests left to fix. I imagine I'll learn more as I go.

From the setlocale(3) man page on my Linux box: The locale "C" or "POSIX" is a portable locale; it exists on all conforming systems.