I have a (unfortunately very large still) chunk of C++ code which clang++ never appears to finish compiling. The following three criteria need to met in order for this to happen: * I need to use clang 3.8.0rc1 (the issue does not occur with 3.7.1) * I need to pass -mavx (or something that implies it, like -march=sandybridge) * I need to pass -O2 (the issue does not occur with -O1) I've created a preprocessor dump using clang 3.7.1 and -save-temps that I'm attaching to this report. As expected, it shows the following behaviour: # clang++ 3.7.1 didn't have the problem % ( ulimit -t 10; time clang++3.7.1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=sandybridge ) 1.26s user 0.02s system 99% cpu 1.287 total # clang++ 3.8.0rc1 has the problem % ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=sandybridge ) #0 0x0000000001c048e5 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c048e5) #1 0x0000000001c028a6 llvm::sys::RunSignalHandlers() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c028a6) #2 0x0000000001c02ac4 SignalHandler(int) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c02ac4) #3 0x00007efdc42e08d0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0xf8d0) #4 0x000000000147cebc llvm::FindAvailableLoadedValue(llvm::Value*, llvm::BasicBlock*, llvm::ilist_iterator<llvm::Instruction>&, unsigned int, llvm::AAResults*, llvm::AAMDNodes*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x147cebc) #5 0x00000000019a1b5a llvm::InstCombiner::visitLoadInst(llvm::LoadInst&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x19a1b5a) #6 0x0000000001965213 llvm::InstCombiner::run() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1965213) #7 0x0000000001966309 combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::DominatorTree&, llvm::LoopInfo*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1966309) #8 0x0000000001966c70 (anonymous namespace)::InstructionCombiningPass::runOnFunction(llvm::Function&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1966c70) #9 0x00000000018c7083 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x18c7083) #10 0x00000000018c76cb llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x18c76cb) #11 0x0000000001d1d0d2 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1d1d0d2) #12 0x0000000002271917 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x2271917) #13 0x000000000255755d clang::ParseAST(clang::Sema&, bool, bool) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x255755d) #14 0x00000000022719fb clang::CodeGenAction::ExecuteAction() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x22719fb) #15 0x0000000001fd9b26 clang::FrontendAction::Execute() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1fd9b26) #16 0x0000000001fb3216 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1fb3216) #17 0x00000000020601b3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x20601b3) #18 0x0000000000a985c8 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa985c8) #19 0x0000000000a57c82 main (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa57c82) #20 0x00007efdc350ab45 __libc_start_main /build/glibc-3Vu5mt/glibc-2.19/csu/libc-start.c:321:0 #21 0x0000000000a947c4 _start (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa947c4) Stack dump: 0. Program arguments: /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8 -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -main-file-name uggridgeometry.ii -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu sandybridge -momit-leaf-frame-pointer -dwarf-column-info -debugger-tuning=gdb -coverage-file /dev/null -resource-dir /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/../lib/clang/3.8.0 -O2 -std=c++14 -fdeprecated-macro -fdebug-compilation-dir /home/mi/pipping/dune/build-Release/dune-grid/lib -ferror-limit 19 -fmessage-length 175 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /dev/null -x c++-cpp-output uggridgeometry.ii 1. <eof> parser at end of file 2. Per-module optimization passes 3. Running pass 'Function Pass Manager' on module 'uggridgeometry.ii'. 4. Running pass 'Combine redundant instructions' on function '@_ZN4Dune5UG_NSILi3EE14TransformationEiPPdRKNS_11FieldVectorIdLi3EEERNS_11FieldMatrixIdLi3ELi3EEE' clang-3.8: error: unable to execute command: CPU time limit exceeded clang-3.8: error: clang frontend command failed due to signal (use -v to see invocation) clang version 3.8.0 (tags/RELEASE_380/rc1) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin clang-3.8: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script. clang-3.8: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs. 11.12s user 0.06s system 98% cpu 11.348 total # The problem goes away with -O1 % ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O1 -march=sandybridge ) 1.30s user 0.04s system 99% cpu 1.340 total # The problem goes away with an architecture that does not support AVX % ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=westmere ) 1.29s user 0.04s system 99% cpu 1.340 total # The problem is triggered by the addition of AVX % ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=westmere -mavx ) < killed after 10s, same output a above > # The problem is triggered by AVX alone % ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -mavx ) < killed after 10s, same output a above >
Created attachment 15739 [details] bzip2-compressed preprocessor dump
+David, looks like it has instcombine on the stack.
reduced IR: target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" define i1 @f(double* %tmp, i1 %B) align 2 { entry: %tmp1 = bitcast double* %tmp to <2 x double>* %tmp2 = load <2 x double>, <2 x double>* %tmp1, align 8 %tmp3 = extractelement <2 x double> %tmp2, i32 0 %sub12 = fsub double undef, %tmp3 %tmp5 = extractelement <2 x double> %tmp2, i32 1 br i1 %B, label %if.else, label %if.end if.else: ; preds = %entry %tmp6 = insertelement <4 x double> zeroinitializer, double %tmp5, i32 3 br label %if.end if.end: ; preds = %if.else, %entry %phi = phi <4 x double> [ undef, %entry ], [ %tmp6, %if.else ] %tmp7 = fadd <4 x double> undef, %phi %tmp9 = extractelement <4 x double> %tmp7, i32 1 %mul37 = fmul double %sub12, %tmp9 %mul38 = fmul double %mul37, undef %tobool39 = fcmp une double %mul38, 0.000000e+00 ret i1 %tobool39 }
Bisected to r256394: Author: Sanjay Patel <spatel@rotateright.com> Date: Thu Dec 24 21:17:56 2015 +0000 [InstCombine] transform more extract/insert pairs into shuffles (PR2109) This is an extension of the shuffle combining from r203229: http://reviews.llvm.org/rL203229 The idea is to widen a short input vector with undef elements so the existing shuffle transform for extract/insert can kick in. The motivation is to finally solve PR2109: https://llvm.org/bugs/show_bug.cgi?id=2109 For that example, the IR becomes: %1 = bitcast <2 x i32>* %P to <2 x float>* %ld1 = load <2 x float>, <2 x float>* %1, align 8 %2 = shufflevector <2 x float> %ld1, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> %i2 = shufflevector <4 x float> %A, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 5> ret <4 x float> %i2 And x86 SSE output improves from: movq (%rdi), %xmm1 ## xmm1 = mem[0],zero movdqa %xmm1, %xmm2 shufps $229, %xmm2, %xmm2 ## xmm2 = xmm2[1,1,2,3] shufps $48, %xmm0, %xmm1 ## xmm1 = xmm1[0,0],xmm0[3,0] shufps $132, %xmm1, %xmm0 ## xmm0 = xmm0[0,1],xmm1[0,2] shufps $32, %xmm0, %xmm2 ## xmm2 = xmm2[0,0],xmm0[2,0] shufps $36, %xmm2, %xmm0 ## xmm0 = xmm0[0,1],xmm2[2,0] retq To the almost optimal: movhpd (%rdi), %xmm0 Note: There's a tension in the existing transform related to generating arbitrary shufflevector masks. We avoid that in other places in InstCombine because we're scared that codegen can't handle strange masks, but it looks like we're ok with producing those here. I purposely chose weird insert/extract indexes for the regression tests to see the effect in these cases. For PowerPC+Altivec, AArch64, and X86+SSE/AVX, I think the codegen is equal or better for these examples. Sanjay, can you please take a look?
(In reply to comment #4) > Sanjay, can you please take a look? Yep, checking it out now. Thanks for the reduction!
I've checked in a patch here: http://reviews.llvm.org/rL259236 The change should be strictly limiting the offending transform, so I think it should be safe to accept to the 3.8 branch (assuming there's no bot fallout).
This fixes the problem for me. :) Thanks a lot to David Majnemer for the reduction! Thanks a lot to Sanjay Patel for the fix!