40961 – loop vectorizer unable to reason with trip count bounds

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40961 - loop vectorizer unable to reason with trip count bounds

Summary: loop vectorizer unable to reason with trip count bounds

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Loop Optimizer (show other bugs)
Version:	trunk
Hardware:	Other All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-04 14:18 PST by Kalev Soikonen
Modified:	2021-11-10 07:56 PST (History)
CC List:	8 users (show)

See Also:	47247 52464
Fixed By Commit(s):

Attachments
experimental fix (7.11 KB, patch) 2019-09-03 07:03 PDT, Vivek Pandya	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kalev Soikonen 2019-03-04 14:18:31 PST

I see bugs 38280, 37423, but this one is simpler still. Consider the following:

    void topup(int a[], unsigned long i)
    {
        for (; i < 16; i++) {
            a[i] = 1;
        }
    }

The result, compiled with -O -march=haswell

- loop is vectorized despite the small absolute limit on trip count
- vectorized part has loop step of 256 (!) and is actually unreachable

Comment 1 Florian Hahn 2019-03-04 14:39:42 PST

What's happening depends on the target: for x86_64-apple-macos we generate a check if I < 16 and a call to memset_pattern16, for x86_64-unknown-linux we generate a big vectorized version as reported.

I guess the problem is the unknown start value of the induction variable.

Comment 2 Hideki Saito 2019-03-04 15:57:42 PST

This is coming from LoopUtil's getLoopEstimatedTripCount returning "None".

So, some part of the compiler (such as apple-macos calling memset_pattern16) knows trip count of up to 16 is special and other parts (like LV) does not.

I think the notion of estimated trip count information needs to be extended beyond single value.

What can be described by Intel's loopcount pragma may be a good starting point for the discussion: It can represent min/max/ave as well as individual known constant values. 

Here, we have 16 or less (no wrap case) ---- or possibly big number (wrap case). I could see why we'd want to distinguish this from i=0;i<N;i++ where N can be an arbitrary big number.

Comment 3 Kalev Soikonen 2019-03-04 16:29:45 PST

With simple loops, but where the exact trip count is unknown, it should still be possible to perform an analysis on the initial loop condition: value range of either side of the (a < b) relation can restrict the full range of the type, yielding upper, and lower, limits to the trip count.

I don't know how easy it is to get at...

Upper limit may be useful as a heuristic: loop step of approx. sqrt(limit) to minimize the overall branch count.

Comment 4 Hideki Saito 2019-03-04 17:13:09 PST

What I'm saying is that this is most likely the issue that can affect other loop optimizations as well. So, any solution we come up with should not be just vectorizer only. We better have some discussions with many loop optimization stakeholders ---- in llvm-dev. Would you like me to start a thread?

Comment 5 Eli Friedman 2019-03-04 19:09:19 PST

I don't think there's any fundamnental issue to bring up on llvmdev.  The vectorizer tries to handle this sort of construct by calling ScalarEvolution::getSmallConstantMaxTripCount(); it just isn't succeeding here for some reason.

Comment 6 Hideki Saito 2019-03-05 11:03:50 PST

Sorry, you are right. I was thinking "i != 16" case.

Comment 7 Vivek Pandya 2019-09-01 23:06:23 PDT

I am trying to find out why ScalarEvolution is not able to give correct back edge taken count for an expression.

So in this case flow reaches to howFarToZero() and in that function, I have following expressions as SCEV

Start = (15 + (-1 * %i) (which is set to Distance SCEV)
Step = 1

now, first of all, should I expect Start as ConstantSCEV (15) instead of the whole expression
the problem here is getUnsignedRangeMax(Distance) returns very large number because of -1 in the SCEV.

How we can make this work? Here we can clearly say that after 15 steps this expression will be 0
and thus we have a value for backedge taken count.

Comment 8 Vivek Pandya 2019-09-03 07:03:58 PDT

Created attachment 22464 [details]
experimental fix

Please look at the proposed fix and if this seems in the correct direction I can send it for review.

Comment 9 Eli Friedman 2019-09-03 13:30:03 PDT

That patch is taking the wrong approach.

The IR should look something like this:

define void @topup(i32* nocapture %a, i64 %i) {
entry:
  %cmp3 = icmp ult i64 %i, 16
  br i1 %cmp3, label %for.body, label %for.end

for.body:
  %i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
  store i32 1, i32* %arrayidx, align 4, !tbaa !2
  %inc = add nuw nsw i64 %i.addr.04, 1
  %exitcond = icmp eq i64 %inc, 16
  br i1 %exitcond, label %for.end, label %for.body

for.end:
  ret void
}


If you look at the IR, you have a loop which is roughly of the form "for (; i != 16; ++i)".  In general, for loops of that form, the backedge-taken count is unrestricted: unsigned arithmetic wraps.  (For example, if i is -10, the loop runs 26 times.)

There are two factors you could use to prove the max backedge-taken count.  One, the induction variable is marked "nuw".  Two, the loop is guarded by an "icmp ult" which restricts the initial value of the induction variable.  No patch can work correctly without using one of those bits of information.

Comment 10 Florian Hahn 2019-09-03 15:33:27 PDT

I experimented with folding the information from the preheader into the expression to compute the max backedge taken count today. I’ll try to share a patch tomorrow.

Comment 11 Vivek Pandya 2019-09-04 04:39:36 PDT

(In reply to Eli Friedman from comment #9)
> That patch is taking the wrong approach.
> 
> The IR should look something like this:
> 
> define void @topup(i32* nocapture %a, i64 %i) {
> entry:
>   %cmp3 = icmp ult i64 %i, 16
>   br i1 %cmp3, label %for.body, label %for.end
> 
> for.body:
>   %i.addr.04 = phi i64 [ %inc, %for.body ], [ %i, %entry ]
>   %arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.addr.04
>   store i32 1, i32* %arrayidx, align 4, !tbaa !2
>   %inc = add nuw nsw i64 %i.addr.04, 1
>   %exitcond = icmp eq i64 %inc, 16
>   br i1 %exitcond, label %for.end, label %for.body
> 
> for.end:
>   ret void
> }
> 
> 
> If you look at the IR, you have a loop which is roughly of the form "for (;
> i != 16; ++i)".  In general, for loops of that form, the backedge-taken
> count is unrestricted: unsigned arithmetic wraps.  (For example, if i is
> -10, the loop runs 26 times.)
> 
> There are two factors you could use to prove the max backedge-taken count. 
> One, the induction variable is marked "nuw".  Two, the loop is guarded by an
> "icmp ult" which restricts the initial value of the induction variable.  No
> patch can work correctly without using one of those bits of information.

Here i is declared as unsigned so max back-edge taken count should be 16 because for any value greater than 16 loops should not get executed and there will not be unsigned wrapping of i. So I think without "nuw" based on the loop guard we should be able to figure out max back edge taken count.

Comment 12 Florian Hahn 2019-09-04 08:19:38 PDT

I think we could achieve the desired result by folding information from the loop guards into the expression we use to compute the max backedge taken count. I've put up a initial patch, which just deals with getting a more precise max count https://reviews.llvm.org/D67178

Comment 13 Eli Friedman 2019-09-04 12:18:12 PDT

> So I think without "nuw" based on the loop guard we should be able to figure < > out max back edge taken count.

I thought that's what I said; you need either the nuw or the loop guard (but not both).

Comment 14 Florian Hahn 2020-10-08 05:22:58 PDT

Just a quick update. The loop vectorizer now only uses a vectorization factor of 8 for the example, but the unroller unrolls with vector body 8 times later on

https://godbolt.org/z/MdWWbd