LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 18773 - LICM sinking a single unused load out of a loop prevents DSE in some other function!?!
Summary: LICM sinking a single unused load out of a loop prevents DSE in some other fu...
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Loop Optimizer (show other bugs)
Version: trunk
Hardware: PC Linux
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-08 04:44 PST by Chandler Carruth
Modified: 2014-02-10 16:39 PST (History)
4 users (show)

See Also:
Fixed By Commit(s):


Attachments
reduced test case (42.47 KB, application/octet-stream)
2014-02-08 04:44 PST, Chandler Carruth
Details
Narrower test case (21.05 KB, application/octet-stream)
2014-02-10 04:40 PST, Chandler Carruth
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chandler Carruth 2014-02-08 04:44:04 PST
Created attachment 12031 [details]
reduced test case

This is crazy weird. The essential problem is that LICM got slightly more powerful in r200067 and started sinking loads out of the loop body in a corner case. When it does so, we mysteriously stop being able to DSE stores in *a separate function*!

The code for this is heavily reduced from Adobe-C++/loop_unroll.cpp. To reproduce the weird behavior, take an "opt" binary from trunk and "old opt" from r200066, and compare:

% opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.new.s
% clang++ -lm -o loop_unroll.new loop_unroll.new.s

vs.

% old_opt < loop_unroll.reduced.ll -std-link-opts | llc -O3 -o loop_unroll.old.s
% clang++ -lm -o loop_unroll.old loop_unroll.old.s


When I benchmark these on my sandybridge machine I get:

% perf stat -r5 ./loop_unroll.new && perf stat -r5 ./loop_unroll.old

 Performance counter stats for './loop_unroll.new' (5 runs):

       1376.720033 task-clock                #    0.998 CPUs utilized            ( +-  0.10% )
                 1 context-switches          #    0.001 K/sec                    ( +- 31.62% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +- 61.24% )
               152 page-faults               #    0.111 K/sec                    ( +-  0.16% )
     5,208,493,602 cycles                    #    3.783 GHz                      ( +-  0.02% )
     2,657,152,524 stalled-cycles-frontend   #   51.02% frontend cycles idle     ( +-  0.04% )
       279,081,881 stalled-cycles-backend    #    5.36% backend  cycles idle     ( +-  0.34% )
     8,515,758,351 instructions              #    1.63  insns per cycle        
                                             #    0.31  stalled cycles per insn  ( +-  0.00% )
       356,723,894 branches                  #  259.111 M/sec                    ( +-  0.00% )
           201,373 branch-misses             #    0.06% of all branches          ( +-  0.01% )

       1.378844222 seconds time elapsed                                          ( +-  0.10% )


 Performance counter stats for './loop_unroll.old' (5 runs):

        877.683533 task-clock                #    0.998 CPUs utilized            ( +-  0.09% )
                 1 context-switches          #    0.001 K/sec                    ( +- 40.82% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
               152 page-faults               #    0.174 K/sec                    ( +-  0.16% )
     3,320,502,190 cycles                    #    3.783 GHz                      ( +-  0.00% )
        11,331,992 stalled-cycles-frontend   #    0.34% frontend cycles idle     ( +-  2.39% )
        35,003,292 stalled-cycles-backend    #    1.05% backend  cycles idle     ( +-  1.14% )
     6,978,870,054 instructions              #    2.10  insns per cycle        
                                             #    0.01  stalled cycles per insn  ( +-  0.00% )
       356,371,371 branches                  #  406.036 M/sec                    ( +-  0.00% )
           200,790 branch-misses             #    0.06% of all branches          ( +-  0.04% )

       0.879302485 seconds time elapsed                                          ( +-  0.09% )


Note the 50% stalled cycles on the new one!!!


I'm working on getting two A/B inputs to trunk 'opt' that exhibit the behavior, lacking any good ideas about why its actually happening.


Note that I've checked -- top of tree and r200067 behave exactly the same. The change is only in the patch committed with r200067.
Comment 1 Chandler Carruth 2014-02-10 04:40:51 PST
Created attachment 12037 [details]
Narrower test case

Ok, I've reduced the steps to reproduce significantly:

% opt -S < loop_unroll.reduced.ll -debug-pass=Arguments -tbaa -basicaa -globalsmodref-aa -domtree -loops -loop-simplify -licm -memdep -dse | llc -O3 -o loop_unroll.s
% clang -lm -o loop_unroll loop_unroll.s

With the 'opt' binary before r200067 this will successfully eliminate all the stores but the last to %result in _ZN15loop_inner_bodyILi13EiE7do_workERiPKii, but will fail to with top-of-tree. Running this program will stall 50% of its cycles with top of tree as consequence.


The problem appears to be some combination of globalsmodref-aa, licm, and dse. The obvious answer is that globalsmodref-aa becomes significantly weaker with the more aggressive LICM transformation. That much is clear from the analysis, but I've not yet even looked at the implementation of globalsmodref-aa to figure out *why* or why we can't fix this some other way.
Comment 2 Chandler Carruth 2014-02-10 12:05:03 PST
I have this diagnosed fully and am working on a fix... =[ The root cause is rather horrid.
Comment 3 Chandler Carruth 2014-02-10 16:39:27 PST
This should be fixed in r201104. There may be more demons lurking like this, but new PRs for them.