18773 – LICM sinking a single unused load out of a loop prevents DSE in some other function!?!

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 18773 - LICM sinking a single unused load out of a loop prevents DSE in some other function!?!

Summary: LICM sinking a single unused load out of a loop prevents DSE in some other fu...

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Loop Optimizer (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-02-08 04:44 PST by Chandler Carruth
Modified:	2014-02-10 16:39 PST (History)
CC List:	4 users (show)

See Also:
Fixed By Commit(s):

Attachments
reduced test case (42.47 KB, application/octet-stream) 2014-02-08 04:44 PST, Chandler Carruth	Details
Narrower test case (21.05 KB, application/octet-stream) 2014-02-10 04:40 PST, Chandler Carruth	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chandler Carruth 2014-02-08 04:44:04 PST

Created attachment 12031 [details]
reduced test case

This is crazy weird. The essential problem is that LICM got slightly more powerful in r200067 and started sinking loads out of the loop body in a corner case. When it does so, we mysteriously stop being able to DSE stores in *a separate function*!

The code for this is heavily reduced from Adobe-C++/loop_unroll.cpp. To reproduce the weird behavior, take an "opt" binary from trunk and "old opt" from r200066, and compare:

% opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.new.s
% clang++ -lm -o loop_unroll.new loop_unroll.new.s

vs.

% old_opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.old.s
% clang++ -lm -o loop_unroll.old loop_unroll.old.s


When I benchmark these on my sandybridge machine I get:

% perf stat -r5 ./loop_unroll.new && perf stat -r5 ./loop_unroll.old

 Performance counter stats for './loop_unroll.new' (5 runs):

       1376.720033 task-clock                #    0.998 CPUs utilized            ( +-  0.10% )
                 1 context-switches          #    0.001 K/sec                    ( +- 31.62% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +- 61.24% )
               152 page-faults               #    0.111 K/sec                    ( +-  0.16% )
     5,208,493,602 cycles                    #    3.783 GHz                      ( +-  0.02% )
     2,657,152,524 stalled-cycles-frontend   #   51.02% frontend cycles idle     ( +-  0.04% )
       279,081,881 stalled-cycles-backend    #    5.36% backend  cycles idle     ( +-  0.34% )
     8,515,758,351 instructions              #    1.63  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.00% )
       356,723,894 branches                  #  259.111 M/sec                    ( +-  0.00% )
           201,373 branch-misses             #    0.06% of all branches          ( +-  0.01% )

       1.378844222 seconds time elapsed                                          ( +-  0.10% )


 Performance counter stats for './loop_unroll.old' (5 runs):

        877.683533 task-clock                #    0.998 CPUs utilized            ( +-  0.09% )
                 1 context-switches          #    0.001 K/sec                    ( +- 40.82% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
               152 page-faults               #    0.174 K/sec                    ( +-  0.16% )
     3,320,502,190 cycles                    #    3.783 GHz                      ( +-  0.00% )
        11,331,992 stalled-cycles-frontend   #    0.34% frontend cycles idle     ( +-  2.39% )
        35,003,292 stalled-cycles-backend    #    1.05% backend  cycles idle     ( +-  1.14% )
     6,978,870,054 instructions              #    2.10  insns per cycle        
                                             #    0.01  stalled cycles per insn  ( +-  0.00% )
       356,371,371 branches                  #  406.036 M/sec                    ( +-  0.00% )
           200,790 branch-misses             #    0.06% of all branches          ( +-  0.04% )

       0.879302485 seconds time elapsed                                          ( +-  0.09% )


Note the 50% stalled cycles on the new one!!!


I'm working on getting two A/B inputs to trunk 'opt' that exhibit the behavior, lacking any good ideas about why its actually happening.


Note that I've checked -- top of tree and r200067 behave exactly the same. The change is only in the patch committed with r200067.

Comment 1 Chandler Carruth 2014-02-10 04:40:51 PST

Created attachment 12037 [details]
Narrower test case

Ok, I've reduced the steps to reproduce significantly:

% opt -S < loop_unroll.reduced.ll -debug-pass=Arguments -tbaa -basicaa -globalsmodref-aa -domtree -loops -loop-simplify -licm -memdep -dse | llc -O3 -o loop_unroll.s
% clang -lm -o loop_unroll loop_unroll.s

With the 'opt' binary before r200067 this will successfully eliminate all the stores but the last to %result in _ZN15loop_inner_bodyILi13EiE7do_workERiPKii, but will fail to with top-of-tree. Running this program will stall 50% of its cycles with top of tree as consequence.


The problem appears to be some combination of globalsmodref-aa, licm, and dse. The obvious answer is that globalsmodref-aa becomes significantly weaker with the more aggressive LICM transformation. That much is clear from the analysis, but I've not yet even looked at the implementation of globalsmodref-aa to figure out *why* or why we can't fix this some other way.

Comment 2 Chandler Carruth 2014-02-10 12:05:03 PST

I have this diagnosed fully and am working on a fix... =[ The root cause is rather horrid.

Comment 3 Chandler Carruth 2014-02-10 16:39:27 PST

This should be fixed in r201104. There may be more demons lurking like this, but new PRs for them.