26354 – regression: Compilation hangs with -O2 -mavx for certain input (valid code)

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 26354 - regression: Compilation hangs with -O2 -mavx for certain input (valid code)

Summary: regression: Compilation hangs with -O2 -mavx for certain input (valid code)

Status:	RESOLVED FIXED

Alias:	None

Product:	clang
Classification:	Unclassified
Component:	C++14 (show other bugs)
Version:	3.8
Hardware:	PC All

Importance:	P release blocker
Assignee:	Unassigned Clang Bugs

URL:
Keywords:	regression

Depends on:
Blocks:	26059
	Show dependency tree

Reported:	2016-01-28 06:20 PST by Elias Pipping
Modified:	2016-01-30 06:11 PST (History)
CC List:	4 users (show)

See Also:
Fixed By Commit(s):

Attachments
bzip2-compressed preprocessor dump (242.16 KB, application/x-bzip2) 2016-01-28 06:22 PST, Elias Pipping	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Elias Pipping 2016-01-28 06:20:59 PST

I have a (unfortunately very large still) chunk of C++ code which clang++ never appears to finish compiling.

The following three criteria need to met in order for this to happen:
 * I need to use clang 3.8.0rc1 (the issue does not occur with 3.7.1)
 * I need to pass -mavx (or something that implies it, like -march=sandybridge)
 * I need to pass -O2 (the issue does not occur with -O1)

I've created a preprocessor dump using clang 3.7.1 and -save-temps that I'm attaching to this report. As expected, it shows the following behaviour:

# clang++ 3.7.1 didn't have the problem
% ( ulimit -t 10; time clang++3.7.1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=sandybridge )
1.26s user 0.02s system 99% cpu 1.287 total

# clang++ 3.8.0rc1 has the problem
% ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=sandybridge )
#0 0x0000000001c048e5 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c048e5)
#1 0x0000000001c028a6 llvm::sys::RunSignalHandlers() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c028a6)
#2 0x0000000001c02ac4 SignalHandler(int) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1c02ac4)
#3 0x00007efdc42e08d0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0xf8d0)
#4 0x000000000147cebc llvm::FindAvailableLoadedValue(llvm::Value*, llvm::BasicBlock*, llvm::ilist_iterator<llvm::Instruction>&, unsigned int, llvm::AAResults*, llvm::AAMDNodes*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x147cebc)
#5 0x00000000019a1b5a llvm::InstCombiner::visitLoadInst(llvm::LoadInst&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x19a1b5a)
#6 0x0000000001965213 llvm::InstCombiner::run() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1965213)
#7 0x0000000001966309 combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::DominatorTree&, llvm::LoopInfo*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1966309)
#8 0x0000000001966c70 (anonymous namespace)::InstructionCombiningPass::runOnFunction(llvm::Function&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1966c70)
#9 0x00000000018c7083 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x18c7083)
#10 0x00000000018c76cb llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x18c76cb)
#11 0x0000000001d1d0d2 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1d1d0d2)
#12 0x0000000002271917 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x2271917)
#13 0x000000000255755d clang::ParseAST(clang::Sema&, bool, bool) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x255755d)
#14 0x00000000022719fb clang::CodeGenAction::ExecuteAction() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x22719fb)
#15 0x0000000001fd9b26 clang::FrontendAction::Execute() (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1fd9b26)
#16 0x0000000001fb3216 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x1fb3216)
#17 0x00000000020601b3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0x20601b3)
#18 0x0000000000a985c8 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa985c8)
#19 0x0000000000a57c82 main (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa57c82)
#20 0x00007efdc350ab45 __libc_start_main /build/glibc-3Vu5mt/glibc-2.19/csu/libc-start.c:321:0
#21 0x0000000000a947c4 _start (/home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8+0xa947c4)
Stack dump:
0.	Program arguments: /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/clang-3.8 -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -main-file-name uggridgeometry.ii -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu sandybridge -momit-leaf-frame-pointer -dwarf-column-info -debugger-tuning=gdb -coverage-file /dev/null -resource-dir /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin/../lib/clang/3.8.0 -O2 -std=c++14 -fdeprecated-macro -fdebug-compilation-dir /home/mi/pipping/dune/build-Release/dune-grid/lib -ferror-limit 19 -fmessage-length 175 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /dev/null -x c++-cpp-output uggridgeometry.ii 
1.	<eof> parser at end of file
2.	Per-module optimization passes
3.	Running pass 'Function Pass Manager' on module 'uggridgeometry.ii'.
4.	Running pass 'Combine redundant instructions' on function '@_ZN4Dune5UG_NSILi3EE14TransformationEiPPdRKNS_11FieldVectorIdLi3EEERNS_11FieldMatrixIdLi3ELi3EEE'
clang-3.8: error: unable to execute command: CPU time limit exceeded
clang-3.8: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 3.8.0 (tags/RELEASE_380/rc1)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/mi/pipping/dune/inst/clang-3.8.0rc1/bin
clang-3.8: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang-3.8: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs.
11.12s user 0.06s system 98% cpu 11.348 total

# The problem goes away with -O1
% ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O1 -march=sandybridge )
1.30s user 0.04s system 99% cpu 1.340 total

# The problem goes away with an architecture that does not support AVX
% ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=westmere )
1.29s user 0.04s system 99% cpu 1.340 total

# The problem is triggered by the addition of AVX
% ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -march=westmere -mavx )
< killed after 10s, same output a above >

# The problem is triggered by AVX alone
% ( ulimit -t 10; time clang++3.8.0rc1 -c -std=c++14 -o /dev/null uggridgeometry.ii -O2 -mavx )
< killed after 10s, same output a above >

Comment 1 Elias Pipping 2016-01-28 06:22:00 PST

Created attachment 15739 [details]
bzip2-compressed preprocessor dump

Comment 2 Hans Wennborg 2016-01-28 11:07:30 PST

+David, looks like it has instcombine on the stack.

Comment 3 David Majnemer 2016-01-28 13:15:35 PST

reduced IR:
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i1 @f(double* %tmp, i1 %B) align 2 {
entry:
  %tmp1 = bitcast double* %tmp to <2 x double>*
  %tmp2 = load <2 x double>, <2 x double>* %tmp1, align 8
  %tmp3 = extractelement <2 x double> %tmp2, i32 0
  %sub12 = fsub double undef, %tmp3
  %tmp5 = extractelement <2 x double> %tmp2, i32 1
  br i1 %B, label %if.else, label %if.end

if.else:                                          ; preds = %entry
  %tmp6 = insertelement <4 x double> zeroinitializer, double %tmp5, i32 3
  br label %if.end

if.end:                                           ; preds = %if.else, %entry
  %phi = phi <4 x double> [ undef, %entry ], [ %tmp6, %if.else ]
  %tmp7 = fadd <4 x double> undef, %phi
  %tmp9 = extractelement <4 x double> %tmp7, i32 1
  %mul37 = fmul double %sub12, %tmp9
  %mul38 = fmul double %mul37, undef
  %tobool39 = fcmp une double %mul38, 0.000000e+00
  ret i1 %tobool39
}

Comment 4 David Majnemer 2016-01-28 14:41:30 PST

Bisected to r256394:
Author: Sanjay Patel <spatel@rotateright.com>
Date:   Thu Dec 24 21:17:56 2015 +0000

    [InstCombine] transform more extract/insert pairs into shuffles (PR2109)
    
    This is an extension of the shuffle combining from r203229:
    http://reviews.llvm.org/rL203229
    
    The idea is to widen a short input vector with undef elements so the
    existing shuffle transform for extract/insert can kick in.
    
    The motivation is to finally solve PR2109:
    https://llvm.org/bugs/show_bug.cgi?id=2109
    
    For that example, the IR becomes:
    
    %1 = bitcast <2 x i32>* %P to <2 x float>*
    %ld1 = load <2 x float>, <2 x float>* %1, align 8
    %2 = shufflevector <2 x float> %ld1, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
    %i2 = shufflevector <4 x float> %A, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
    ret <4 x float> %i2
    
    And x86 SSE output improves from:
    
    movq	(%rdi), %xmm1           ## xmm1 = mem[0],zero
    movdqa	%xmm1, %xmm2
    shufps	$229, %xmm2, %xmm2      ## xmm2 = xmm2[1,1,2,3]
    shufps	$48, %xmm0, %xmm1       ## xmm1 = xmm1[0,0],xmm0[3,0]
    shufps	$132, %xmm1, %xmm0      ## xmm0 = xmm0[0,1],xmm1[0,2]
    shufps	$32, %xmm0, %xmm2       ## xmm2 = xmm2[0,0],xmm0[2,0]
    shufps	$36, %xmm2, %xmm0       ## xmm0 = xmm0[0,1],xmm2[2,0]
    retq
    
    To the almost optimal:
    
    movhpd	(%rdi), %xmm0
    
    Note: There's a tension in the existing transform related to generating
    arbitrary shufflevector masks. We avoid that in other places in InstCombine
    because we're scared that codegen can't handle strange masks, but it looks
    like we're ok with producing those here. I purposely chose weird insert/extract
    indexes for the regression tests to see the effect in these cases.
    For PowerPC+Altivec, AArch64, and X86+SSE/AVX, I think the codegen is equal or
    better for these examples.

Sanjay, can you please take a look?

Comment 5 Sanjay Patel 2016-01-28 14:57:30 PST

(In reply to comment #4) 
> Sanjay, can you please take a look?

Yep, checking it out now. Thanks for the reduction!

Comment 6 Sanjay Patel 2016-01-29 14:28:52 PST

I've checked in a patch here:
http://reviews.llvm.org/rL259236

The change should be strictly limiting the offending transform, so I think it should be safe to accept to the 3.8 branch (assuming there's no bot fallout).

Comment 7 Elias Pipping 2016-01-30 06:11:23 PST

This fixes the problem for me. :)

Thanks a lot to David Majnemer for the reduction!
Thanks a lot to Sanjay Patel for the fix!