Compiling 4.2 on armv7 (and i686)

Kaiede · September 18, 2018, 3:45pm

I'm attempting to build Swift 4.2 on armv7. The swift-4.2-RELEASE tag, specifically. Unfortunately, after fixing a couple simple typecast issues I'm hitting a rather gnarly issue when I get to linking Swift.o itself for the stdlib, where the optimizer in the built swift binary goes south:

swift: /home/pi/buildSwiftOnARM/llvm/include/llvm/ADT/Optional.h:160: T &&llvm::Optional<swift::ProjectionKind>::operator*() &&[T = swift::ProjectionKind]: Assertion 'hasVal' failed.

2. .      While running pass #418236 SILFunctionTransform "RedundantLoadElimination" on SILFunction "@$Ss18ReversedCollectionV8IteratorVyx_Gs0C8ProtocolsSt4next7ElementQzSgyFTWs16_UnmanagedStringVys6UInt16VG_Tgq5".
 for 'next()' at /home/pi/buildSwiftOnARM/swift/stdlib/public/core/Reverse.swift:91:19

Interestingly, the built swift can't demangle this symbol, which may be related to my problem. The Xcode 10 build can demangle it though: @generic specialization <preserving fragile attribute, Swift._UnmanagedString<Swift.UInt16>> of protocol witness for Swift.IteratorProtocol.next() -> A.Element? in conformance Swift.ReversedCollection<A>.Iterator : Swift.IteratorProtocol in Swift

Unfortunately, attempting to build swift in debug on this device appears to be out of the question if I want the LLVM pieces of the stack too. Building with --release-debuginfo will exhaust VM address space when linking libLTO in llvm, and probably others downstream of that one. I've already confirmed swap is not the limiting factor in this case.

I've been attempting to bisect the development history, but this is currently slow going as the change happened prior to the 4.2 branch being created. The first development snapshot (swift-4.2-DEVELOPMENT-SNAPSHOT-2018-04-23-a) has the bug as well. swift-4.1.3-RELEASE does not. Since I'm not really too familiar with the codebase, I'm not really sure where 4.1 forked off of master and stopped receiving regular merges. And I'm hitting a number of other breaks with the development snapshots which make it harder to track things back as it can take a day of effort just to wade through the breaks to find out if a snapshot tag is good or not.

I'm hoping someone might recognize this enough to help me figure out what window to look at in the master branch so I can save some time digging through the history. Or at the least can point me at some information that helps me understand when the version branches forked so I can hunt through the history more efficiently.

Joe_Groff · September 19, 2018, 8:48pm

I'm not sure whether our build script configures LLVM to link with gold, but that might be something to check. gold may scale better than BFD ld, but both tend to eat a ton of memory when linking LLVM with debug info, so you may want to look into cross-building from a more capable host. Alternatively, it's likely that the issue you're running into is a general bit-width bug somewhere in the compiler implementation, so it might also be a worthwhile experiment to see whether building the compiler for i386 on an Intel machine hits the same issue. As for where to bisect, we don't generally merge back to release branches, instead cherry-picking select fixes from master into release branches as needed, so wherever swift-4.1-branch most recently branched from master may serve as your bisection point. Note that Swift is fairly tightly coupled to LLVM and Clang revisions as well, so when bisecting, you will want to ensure you pick a contemporary revision of the swift-llvm and swift-clang repos to match the Swift revision you're building against.

Kaiede · September 20, 2018, 12:11am

Note that Swift is fairly tightly coupled to LLVM and Clang revisions as well, so when bisecting, you will want to ensure you pick a contemporary revision of the swift-llvm and swift-clang repos to match the Swift revision you're building against.

I've been leveraging update-checkout to help with this. Right now, my "bisecting" is based on development snapshot tags in master, which is painful since I'm running across many unrelated breaks while trying to track down this one. I did find a post about the last "re-merge" from master to the 4.1 branch around the start of last December, but the unrelated breaks with the Dec 5th snapshot make it hard to tell what's going on, and without a good list of fixes to re-cherry-pick on top of the snapshot, it's very slow going.

I'm not sure whether our build script configures LLVM to link with gold , but that might be something to check.

As far as I can tell, my local debug builds are already using ld.gold as the linker. So it's good idea, just doesn't work.

Alternatively, it's likely that the issue you're running into is a general bit-width bug somewhere in the compiler implementation, so it might also be a worthwhile experiment to see whether building the compiler for i386 on an Intel machine hits the same issue.

Agreed, I'm looking to setup a VM to do just that a bit later. And I agree it's probably a bit-width issue somewhere (I've already built a patch for the most obvious stuff), since I'm also looking at Arm64 at the same time. Arm64 builds fine, but then hits SR-7441...

Joe_Groff · September 20, 2018, 7:38pm

Thanks to @blangmuir, I just learned of some additional LLVM build settings that might help with your resource issues. You might try installing lld and running the build script with --extra-cmake-options="-DLLVM_ENABLE_LLD=ON, which will tell LLVM and Swift to link using lld instead of gold, which should be less resource hungry. If you add --extra-cmake-options="-DLLVM_PARALLEL_LINK_JOBS=1", this will also limit the allowed parallelism during link steps, reducing the overall amount of memory used by parallel jobs, which might be a good idea on an armv7 machine even with lld.

Kaiede · September 21, 2018, 12:23am

Thanks for the extra tips. I’m noticing that x86 has no community CI listed. Which kinda makes sense since I’ve hit a couple x86 specific issues using Debian Stretch (good analogue for Raspbian Stretch, which only has an x64 image available).

I think it’ll make sense to submit the patches back assuming someone else didn’t fix them in master already, but I’m also a bit surprised there isn’t a CI node for x86.

The upside is that x86 does fail with the same error. The downside is that I’m still facing a sea of random 32-bit breaks in master between Dec and April if I want to keep trolling the history for clues.

Kaiede · September 22, 2018, 10:18pm

Just wanted to say that I found that this build failure was caused by: [AST] NFC: Misc tiny fixes · apple/swift@7c096c9 · GitHub

Reversing it works to get the same break I see at master’s 2017-12-05 snapshot which is around when 4.1 forked. So there’s at least one more fix I need to find.

Reversing this change probably isn’t the long term fix, but it’s enough to get my rhythm going and chase down these other issues. Thanks for the help so far.

Chasing these down on an iMac VM with a 32-bit distro is much, much faster.

I just wish we were able to catch these 32-bit breaks sooner than 8 months afterwards.

jrose · September 24, 2018, 4:44pm

We don't have any production platforms that use the compiler in a 32-bit environment, but external CI jobs welcome! (I see there's an armv7 Linux bot there but it hasn't run in a month.)

Kaiede · September 24, 2018, 5:15pm

Yeah, I'm one of the folks interested enough to try to see if we can get armv7 and aarch64 caught up, but they've perpetually been behind, making the CI kinda useless. I finally just got 4.1.3 Foundation fixed up enough on armv7 to migrate over in the last couple weeks. It'll be rough trying to go straight for master without working through the source tree.

I'm thinking that an i686/x86 build host is probably useful to the armv7 folks as well, since it is an easier environment to debug in and builds much faster, and will also catch many of the 32-bit issues that we are seeing on armv7 as well.

Unfortunately, while reversing that change does change the error, I get it back if I try to fix the next error down the pipe. And unfortunately, while I can build a debug swift on i686, I'm having a heck of a time trying to understand the state of the compiler at the time of the assert (using gdb at least). lldb-3.8 is having issues launching the executable, but I can give lldb-6.0 a shot, just in case it does work better, and perhaps there are some lldb extensions that can help digest some of these tail allocations that make the types super-complicated?

What I do know is that it's failing while creating a Projection for IndexAddrInst, as assert(getKind() == ProjectionKind::Index); is failing in getKind().

Kaiede · September 25, 2018, 2:10am

And to make things worse, if I take x86 back far enough (to the 4.1 branch), I start hitting x86-specific breaks.

Specifically, it looks like Float80 gets into a weird state because the backing type is assumed to be 16 byte wide with 16 byte alignment (Darwin size/alignment). However, on Linux, it should be 12 bytes wide with 16 byte alignment.

Things go south because the struct gets generated with the Darwin sizes, and LLVM says it should be the Linux sizes. The only thing I can think of is either Swift's ABI assumes Darwin-like behavior here, or it's asking Clang, and somehow asking Clang about Darwin?

Again, any sort of pointers on how to push forward would be helpful.

EDIT: Arg... I hate it when the right search in the code finally lands pay dirt after just about giving up. So yeah, my hunch that Swift assumes Darwin-like alignment/size for long doubles turned out to be right. With luck, I think 4.1.3 might just work on 32-bit desktop Linux with this set of changes. If it does, I'm probably going to get things cleaned up to submit the PR for that.

Kaiede · September 25, 2018, 4:37pm

Aha, so some progress here:

I've got 4.1 building on 32-bit x86 Debian up until it is link time for XCTest/SwiftPM. The Swift compiler is producing some ELF relocations in dynamic libraries that don't really work on 32-bit in the version of LLVM that is being built. (R_386_GOTOFF, in particular) GNU's ld complains about it's use for certain symbols in XCTest, while if I force ld.gold, then SwiftPM has problems in ./lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp digesting the relocations.

There's an awful lot of patches involved to get here, but I'll start collating them and prepping them to be committed back to the various projects (swift, libdispatch, foundation, llbuild).

That said, since I really only need the stdlib / Foundation to build to get some rough idea of the state of 32-bit, I'm probably not going to go back and finish the 32-bit x86 work for the moment. Although it would be kinda cool to see Swift 4.1.3 working. At the very least, I don't think there's all that much hiding behind the ELF relocation issues. If there is another approach that doesn't require adding an implementation for GOT-based relocations to RuntimeDyldELF.cpp, I would be willing to give that a shot. I suspect it's emitting these because Swift hardcodes the Reloc model to be llvm::Reloc::_PIC?

So I'll keep digging in the source history here to see if I can produce more patches for the 32-bit issues I find working forward from last Dec or so, might be a week or two before I have anything else to report.

Joe_Groff · September 25, 2018, 4:49pm

Swift requires PIC, and you must use gold or lld to link. The build system ought to be using gold by default, so there may be a bug there that it's even trying GNU ld. Those relocations are used in Swift's runtime metadata so that it can be mapped read-only without load-time cost while still being able to refer to metadata objects in other images through the GOT. You could ping Lang Hames on the llvm-dev mailing list about maybe adding support for the missing relocations to LLVM's ExecutionEngine. Alternatively, maybe there's a way to get swiftpm to exec a normal binary instead of trying to run code in LLVM's JIT? (cc @Aciid)

It looks like Float80 is indeed hardcoded to be 16-byte aligned:

github.com

apple/swift/blob/main/lib/IRGen/GenType.cpp#L1560


      
            return Types.getTypeMetadataPtrTypeInfo();
          }
          
          
const TypeInfo &TypeConverter::getTypeMetadataPtrTypeInfo() {
            if (TypeMetadataPtrTI) return *TypeMetadataPtrTI;
            TypeMetadataPtrTI = createUnmanagedStorageType(IGM.TypeMetadataPtrTy,
                                                           ReferenceCounting::Unknown,
                                                           /*isOptional*/false);
            TypeMetadataPtrTI->NextConverted = FirstType;
            FirstType = TypeMetadataPtrTI;
            return *TypeMetadataPtrTI;
          }
          
          
const TypeInfo &IRGenModule::getSwiftContextPtrTypeInfo() {
            return Types.getSwiftContextPtrTypeInfo();
          }
          
          
const TypeInfo &TypeConverter::getSwiftContextPtrTypeInfo() {
            if (SwiftContextPtrTI) return *SwiftContextPtrTI;
            SwiftContextPtrTI = createUnmanagedStorageType(IGM.SwiftContextPtrTy,
                                                           ReferenceCounting::Unknown,

I don't think that's correct even on 32-bit i386 Darwin, but we probably never noticed because Swift never supported i386 Darwin. The right thing to do would be to ask the LLVM DataLayout for the platform's size and alignment for the LLVM x86_fp80 type. You may need to add a new builtin type metadata record for types with 16-byte size and 12-byte alignment to the runtime too.

Kaiede · September 25, 2018, 5:00pm

Indeed, I already have the patch for that behavior.

According to Clang/LLVM, Darwin i386 was also 16/16. Linux i386 was 12 byte size, 4 byte aligned. For now, the patch is just asking LLVM itself to tell Swift what the Size/Alignment should be for IEEE80.

The bug is encountered by XCTest, and since there's no SwiftPM at that point, it's hard to tell without digging in if it's just a case of XCTest assuming ld is safe or not. Swift and Foundation are fine, so I suspect it is XCTest. llbuild for example assumes C++ atomic is implemented without a library, while llvm correctly checks for it.

That's somewhat unfortunate to hear about the GOT use, but makes sense. The easiest approach IMO is to implement support. I think x86 ELF is about the only platform that can't digest it at the moment. And GOT32 + GOTOFF are very likely enough to unblock 32-bit x86 Swift, and is probably just beneficial all-up for llvm/swift. The problem is that trying to address SwiftPM at the moment seems like a more complicated beast because of how it is bootstrapping itself. That bootstrap process is what fails.

As I said, it's not blocking at the moment, so I can continue my 32-bit investigations without addressing it in the short-term

Aciid · September 25, 2018, 5:00pm

There is no way to do that currently. We'll have to change how the manifest loading mechanism works which is a non-trivial amount of work.

Joe_Groff · September 25, 2018, 5:03pm

Yeah, in principle, LLVM's ExecutionEngine ought to be able to load any object file that the system linker can, so it's overall beneficial to LLVM to fill in missing relocations. We've had to do this for other platforms sometimes too.

Kaiede · September 25, 2018, 5:15pm

Yup, it's just more that since I have a patch set that builds stdlib + Foundation, it tells me that the compiler + optimizer isn't completely bad and is generally doing the right thing. I'd like to find the regressions that happened after the Dec 3rd snapshot.

Yeah, that's where the error gets hit, loading the manifest. I kinda figured it was something along these lines. Thanks for the confirmation.

Kaiede · September 26, 2018, 5:25pm

I've made some progress and understand at least the history of the main breaks I'm seeing.

In the middle of December, these two commits were added: aa4d53e and 3af569c. These unfortunately broke alignment on 32-bit, because Swift assumes (and asserts) that sizeof(Type) == sizeof(unsigned). And they were allocating something that should be implicitly 8-byte aligned, but then allocating with 4-byte alignment on 32-bit. This was later fixed by 7c096c9f264075d1f38e2621f3659e25c0b78233 on Jan 30th.

However, a second break got introduced before the Jan 30 fix, but never seen because this first break would get hit first building the Swift stdlib, aborting the build before the second break would be seen.

This second break is the hasVal assert during RedundantLoadElimination. The following commits are what git bisect found:

5cf9fd741459fd9e29df03702033a7f8d2fa1240: Remove _StringBuffer|
b360bd6d69818feaf75516a8b2d656ca053139ee: Remove _LegacyStringCore
f2a96496a0a44a86bf5b35f5fad9023ae4ac7add: [StringGuts] Support for 32-bit platforms
3be2faf5d320000b2fae7049dc2400eb1dd4dcbf: [String] Initial implementation of 64-bit StringGuts.
90e894729a0791ce2c78d18607e9c4c8f3667d9d: [StringGuts] Linux support
6d1866f8461a6646190445d65bebae9d223252cd: [StringGuts] Clean-up in preparation for merge.

This is the set of changes that introduced StringGuts. The bisect couldn't get any more detailed than this partly because the build was broken for other reasons while the work was in progress. I suspect that the 32-bit support didn't quite fix 32-bit completely, and the other break hiding this would have made it look like no new break was introduced, even if you were looking closely at this while trying to avoid root causing that other break.

Unfortunately, this is a rather big, and rather core change to how strings work, so it isn't immediately obvious. It's possible there's a forward fix here, but that fix likely didn't make it in the 4.2 branch. I'll start digging in to see if I can make sense of it on my own, but pointers here would be helpful.

cc @milseman (does this work?)

jrose · September 26, 2018, 5:26pm

(@Michael_Ilseman on this site)

Michael_Ilseman · September 26, 2018, 5:42pm

@Kaiede I can help answer any questions about String's ABI, and @lorentey is also familiar with the 32-bit version. On 32-bit, it's just an enum and a UInt. Are you saying that there are assumptions in the compiler or runtime that all such 32-bit Strings are 16 byte aligned?

Kaiede · September 26, 2018, 5:52pm

@Michael_Ilseman / @lorentey,

I don't know if it's because of 16 byte alignment specifically, but because AArch64 doesn't have this issue, and 32-bit x86 and Arm32 do, it's a possible suspect, in my mind. It's pretty clear to be some sort of 32/64-bit mismatch somewhere in here. And the fact that the failure is happening with _UnmanagedString<UInt16> (even if I just look at the Jan 25 snapshot or the merge commit rather than the 4.2 release tag), I am inclined to believe git bisect is working correctly here.

The problem is that the failure itself is far enough removed from the root that debugging through the stack where the assert is fired hasn't been terribly helpful so far. I can make another crack at it now that I'm more familiar with the compiler's behavior after debugging the Float80 behavior on 32-bit x86. It might turn some up some sort of lead that might be helpful, but I was mostly curious if you were already aware of perhaps a forward fix in this space for 32-bit?

Michael_Ilseman · September 26, 2018, 6:47pm

String on x86-32 passes iOS-simulator testing and String on armv7 passes watchOS testing at the time of merge and has since, so I'm unaware of any kind of forward fix. The common denominator is that both are Darwin platforms, which validates your intuition that it might be alignment assumptions (similar to FP80).

I'm indifferent regarding whether we should fix the alignment assumptions, or just force String/others to be 16-byte aligned. @Joe_Groff probably has a better handle on which is the right fix.