Created attachment 12031 [details] reduced test case This is crazy weird. The essential problem is that LICM got slightly more powerful in r200067 and started sinking loads out of the loop body in a corner case. When it does so, we mysteriously stop being able to DSE stores in *a separate function*! The code for this is heavily reduced from Adobe-C++/loop_unroll.cpp. To reproduce the weird behavior, take an "opt" binary from trunk and "old opt" from r200066, and compare: % opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.new.s % clang++ -lm -o loop_unroll.new loop_unroll.new.s vs. % old_opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.old.s % clang++ -lm -o loop_unroll.old loop_unroll.old.s When I benchmark these on my sandybridge machine I get: % perf stat -r5 ./loop_unroll.new && perf stat -r5 ./loop_unroll.old Performance counter stats for './loop_unroll.new' (5 runs): 1376.720033 task-clock # 0.998 CPUs utilized ( +- 0.10% ) 1 context-switches # 0.001 K/sec ( +- 31.62% ) 0 cpu-migrations # 0.000 K/sec ( +- 61.24% ) 152 page-faults # 0.111 K/sec ( +- 0.16% ) 5,208,493,602 cycles # 3.783 GHz ( +- 0.02% ) 2,657,152,524 stalled-cycles-frontend # 51.02% frontend cycles idle ( +- 0.04% ) 279,081,881 stalled-cycles-backend # 5.36% backend cycles idle ( +- 0.34% ) 8,515,758,351 instructions # 1.63 insns per cycle # 0.31 stalled cycles per insn ( +- 0.00% ) 356,723,894 branches # 259.111 M/sec ( +- 0.00% ) 201,373 branch-misses # 0.06% of all branches ( +- 0.01% ) 1.378844222 seconds time elapsed ( +- 0.10% ) Performance counter stats for './loop_unroll.old' (5 runs): 877.683533 task-clock # 0.998 CPUs utilized ( +- 0.09% ) 1 context-switches # 0.001 K/sec ( +- 40.82% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 152 page-faults # 0.174 K/sec ( +- 0.16% ) 3,320,502,190 cycles # 3.783 GHz ( +- 0.00% ) 11,331,992 stalled-cycles-frontend # 0.34% frontend cycles idle ( +- 2.39% ) 35,003,292 stalled-cycles-backend # 1.05% backend cycles idle ( +- 1.14% ) 6,978,870,054 instructions # 2.10 insns per cycle # 0.01 stalled cycles per insn ( +- 0.00% ) 356,371,371 branches # 406.036 M/sec ( +- 0.00% ) 200,790 branch-misses # 0.06% of all branches ( +- 0.04% ) 0.879302485 seconds time elapsed ( +- 0.09% ) Note the 50% stalled cycles on the new one!!! I'm working on getting two A/B inputs to trunk 'opt' that exhibit the behavior, lacking any good ideas about why its actually happening. Note that I've checked -- top of tree and r200067 behave exactly the same. The change is only in the patch committed with r200067.
Created attachment 12037 [details] Narrower test case Ok, I've reduced the steps to reproduce significantly: % opt -S < loop_unroll.reduced.ll -debug-pass=Arguments -tbaa -basicaa -globalsmodref-aa -domtree -loops -loop-simplify -licm -memdep -dse | llc -O3 -o loop_unroll.s % clang -lm -o loop_unroll loop_unroll.s With the 'opt' binary before r200067 this will successfully eliminate all the stores but the last to %result in _ZN15loop_inner_bodyILi13EiE7do_workERiPKii, but will fail to with top-of-tree. Running this program will stall 50% of its cycles with top of tree as consequence. The problem appears to be some combination of globalsmodref-aa, licm, and dse. The obvious answer is that globalsmodref-aa becomes significantly weaker with the more aggressive LICM transformation. That much is clear from the analysis, but I've not yet even looked at the implementation of globalsmodref-aa to figure out *why* or why we can't fix this some other way.
I have this diagnosed fully and am working on a fix... =[ The root cause is rather horrid.
This should be fixed in r201104. There may be more demons lurking like this, but new PRs for them.