Bug/limitation in clang importer with typecast #defines

carlos42421 · August 15, 2020, 12:36pm

It looks to me like there is a limitation in the clang importer when importing simple constants defined like #define X_STD_SIZE 78.

If they are typecast like this...

#define X_STD_SIZE (char)78

...then it works fine but if they are typecast like this...

#define X_STD_SIZE (unsigned char)78

...then it does not work, the X_STD_SIZE constant is not imported and is not available to Swift.

I have tracked this down to code in lib/ClangImporter/ImportMacro.cpp the importMacro function (line 326 in my code).

This code is the problem...

  // Handle tokens starting with a type cast
  bool castTypeIsId = false;
  if (numTokens > 3 &&
      tokenI[0].is(clang::tok::l_paren) &&
      (tokenI[1].is(clang::tok::identifier) ||
        impl.getClangSema().isSimpleTypeSpecifier(tokenI[1].getKind())) &&
      tokenI[2].is(clang::tok::r_paren)) {
    if (!castType.isNull()) {
      // this is a nested cast
      return nullptr;
    }

    if (tokenI[1].is(clang::tok::identifier)) {
      auto identifierInfo = tokenI[1].getIdentifierInfo();
      if (identifierInfo->isStr("id")) {
        castTypeIsId = true;
      }
      auto identifierName = identifierInfo->getName();
      auto &identifier = impl.getClangASTContext().Idents.get(identifierName);

      clang::sema::DelayedDiagnosticPool diagPool{
          impl.getClangSema().DelayedDiagnostics.getCurrentPool()};
      auto diagState = impl.getClangSema().DelayedDiagnostics.push(diagPool);
      auto parsedType = impl.getClangSema().getTypeName(identifier,
                                                        clang::SourceLocation(),
                                                        /*scope*/nullptr);
      impl.getClangSema().DelayedDiagnostics.popWithoutEmitting(diagState);

      if (parsedType && diagPool.empty()) {
        castType = parsedType.get();
      } else {
        return nullptr;
      }
      if (!castType->isBuiltinType() && !castTypeIsId) {
        return nullptr;
      }
    } else {
      auto builtinType = builtinTypeForToken(tokenI[1],
                                             impl.getClangASTContext());
      if (builtinType) {
        castType = builtinType.getValue();
      } else {
        return nullptr;
      }
    }
    tokenI += 3;
    numTokens -= 3;
  }

The problem is fairly obvious when you see it.

In the case when you're casting to char, there are four tokens in total, '(', 'char', ')', '78' and it all works. When you're casting to unsigned char, there are five tokens, '(', 'unsigned', 'char', ')', '78', so token 3 is 'char', not ')' and the test fails.

In order to fix this, the test needs to be more general, like...

  if (numTokens > 3 &&
      tokenI[0].is(clang::tok::l_paren) &&
      (tokenI[1].is(clang::tok::identifier) ||
        impl.getClangSema().isSimpleTypeSpecifier(tokenI[1].getKind())) &&
      (tokenI[2].is(clang::tok::r_paren) || tokenE[-2].is(clang::tok::r_paren))) {

...but I'm not sure how this code works and what is best.

What do people think?

jrose · August 15, 2020, 5:14pm

Yeah, the existing code is absolutely an extremely simple pattern-match. I think we probably don't want to go super-general here (unsigned long int, etc), but matching unsigned followed by a single identifier and a close-paren seems reasonable.

at least 3 tokens: "(", identifier | simple-type, ")"
at least 4 tokens: "(", "unsigned", identifier | simple-type, ")"

There is an additional small concern that successfully importing more macros into Swift could actually break code, since it's going to introduce new names. People may have worked around the absence of these macros by adding new static properties, or something similar. So someone from the core team would probably have to at least look at such a change, though I wouldn't expect them to put it through the proposal process.

carlos42421 · August 15, 2020, 5:18pm

That sounds sensible.

For me personally it would be a win to allow unsigned.

I can create a PR, but I'm not great with this area of code as I've got absolutely zero clang codebase experience. Can you make a suggestion what I might add to the code?

jrose · August 15, 2020, 5:36pm

I think my approach would be

factor the condition of that if out into a separate function that returns an enum with three cases: "not a cast", "cast to type", "cast to unsigned version of type".
at the end of the if, if we're doing a cast-to-unsigned, use the Clang AST context's getCorrespondingUnsignedType to, well, get the unsigned type.
make sure the token iterator is advanced correctly
add tests alongside the other macros-with-casts tests

I think this code already handles unsigned and uint8_t and such, so new tests are really about the syntactic form you're adding (which, by the way, should support things like unsigned int8_t).

Does that feel simple enough to tackle?

carlos42421 · August 15, 2020, 5:48pm

Yes, that makes sense. I started hacking something and made a lot of the same decisions you suggest but it wasn't as elegant and I had no idea about getCorrespondingUnsignedType in clang... I know nothing about clang's internals. Shows it was worth asking!

Oh, and of course I didn't think about adding tests. Which would have (rightly) been an instant PR bounce.

My code is currently based on the release/5.3. I assume I'll have to target any PR to HEAD? I don't know how different they are now?

jrose · August 15, 2020, 5:51pm

I would expect this code to be pretty similar, so retargeting shouldn't be too much of a problem. It's not like macros are drastically changing these days. :-)

benrimmington · August 16, 2020, 8:24pm

Would it be possible to use the constexpr interpreter, to evaluate complex macros?

Wouldn't the new imported macros be shadowed, by the old workaround properties?

jrose · August 17, 2020, 1:56am

No and yes. Macros are just bags of tokens, so Clang isn’t set up to interpret them. But in theory it would be possible to take an argument-less macro, expand it into a function, try to constant-evaluate it, and see if there are any errors. The trouble is…that’d be a lot of work, and it would still come out finicky and failure-prone, and it would be hard to explain what’s supported and what’s not. So it might actually be best to leave things as “simple macros only”, encouraging people to use static constants and functions instead.

Hopefully yes, but not if the compiler picks a different type for the macro.

carlos42421 · August 17, 2020, 8:54am

This is the kind of patch I'm looking at, based on @jrose's ideas and my initial thoughts...

diff --git a/lib/ClangImporter/ImportMacro.cpp b/lib/ClangImporter/ImportMacro.cpp
index d4befac1afc..b8fa15efa27 100644
--- a/lib/ClangImporter/ImportMacro.cpp
+++ b/lib/ClangImporter/ImportMacro.cpp
@@ -322,6 +322,33 @@ static Optional<std::pair<llvm::APSInt, Type>>
   return None;
 }
 
+enum class TokenTypeCastKind {
+  NoTypeCast,  // simple token
+  TypeCast,    // e.g. (char)77
+  UnsignedTypeCast  // e.g. (unsigned char)77
+};
+
+static TokenTypeCastKind tokenHasTypeCast(
+  clang::Sema& sema,
+  const clang::MacroInfo::tokens_iterator &tokenI,
+  unsigned numTokens) {
+
+  if (numTokens > 3 &&
+      tokenI[0].is(clang::tok::l_paren) &&
+      (tokenI[1].is(clang::tok::identifier) ||
+        sema.isSimpleTypeSpecifier(tokenI[1].getKind())) &&
+      tokenI[2].is(clang::tok::r_paren)) {
+    return TokenTypeCastKind::TypeCast;
+  } else if (numTokens > 4 &&
+      tokenI[0].is(clang::tok::l_paren) &&
+      tokenI[1].is(clang::tok::kw_unsigned) &&
+      (sema.isSimpleTypeSpecifier(tokenI[2].getKind())) &&
+      tokenI[3].is(clang::tok::r_paren)) {
+    return TokenTypeCastKind::UnsignedTypeCast;
+  } else {
+    return TokenTypeCastKind::NoTypeCast;
+  }
+}
 
 static ValueDecl *importMacro(ClangImporter::Implementation &impl,
                               DeclContext *DC,
@@ -345,11 +372,8 @@ static ValueDecl *importMacro(ClangImporter::Implementation &impl,
 
   // Handle tokens starting with a type cast
   bool castTypeIsId = false;
-  if (numTokens > 3 &&
-      tokenI[0].is(clang::tok::l_paren) &&
-      (tokenI[1].is(clang::tok::identifier) ||
-        impl.getClangSema().isSimpleTypeSpecifier(tokenI[1].getKind())) &&
-      tokenI[2].is(clang::tok::r_paren)) {
+  auto typeCastKind = tokenHasTypeCast(impl.getClangSema(), tokenI, numTokens);
+  if (typeCastKind != TokenTypeCastKind::NoTypeCast) {
     if (!castType.isNull()) {
       // this is a nested cast
       return nullptr;
@@ -380,10 +404,17 @@ static ValueDecl *importMacro(ClangImporter::Implementation &impl,
         return nullptr;
       }
     } else {
-      auto builtinType = builtinTypeForToken(tokenI[1],
+      bool isUnsigned = typeCastKind == TokenTypeCastKind::UnsignedTypeCast;
+      auto builtinType = builtinTypeForToken(tokenI[(isUnsigned?2:1)],
                                              impl.getClangASTContext());
       if (builtinType) {
         castType = builtinType.getValue();
+        if (isUnsigned) {
+          castType = impl.getClangASTContext().
+            getCorrespondingUnsignedType(castType);
+            tokenI += 1;
+            numTokens -= 1;
+        }
       } else {
         return nullptr;
       }

Even if it's the right approach in principle, I'm sure the code needs lots of work. I've never worked in this area and I'm not very good at C++ coding. Plus obviously it needs unit tests.

When I did a simple run through with a project with a bridging header, it seemed to work and seemed to interpret these macros correctly. But is now showing quirks that I'm checking.

If you guys think it looks roughly right, I'll raise a PR and get it into a more formal review.