# An approach to sustainably add **Vector Predication to the Loop Vectorizer**



Lorenzo Albano <lorenzo.albano@bsc.es> Roger Ferrer < roger.ferrer@bsc.es>

#### Background

The Loop Vectorizer does not make use of Vector Predication intrinsics. We have been extending the Loop Vectorizer with new Vector Planner recipes that are vector length and mask aware. This approach works but leads to recipes duplication in the Loop Vectorizer.

We wanted to see if a simpler approach is workable.

## Method

We implemented a new mode of tail folding in the Loop Vectorizer to only emit the minimum Vector Predication intrinsics needed for correctness: memory accesses and a (target-dependent) set vector length mechanism that depends on the remaining iterations of the loop.

A later pass analyses what is demanded from the vectors, starting from Vector Predication stores. From this analysis, vectorized IR instructions (when possible) are replaced with Vector Predication intrinsics that use the demanded vector length and mask.

## Results

We have used the TSVC-2 benchmark and the RISC-V target. We have compared the emitted code against our earlier, more invasive, implementation. The emitted code is comparable to our previous implementation.

## Conclusion

This approach requires a small set of changes to the Loop Vectorizer while allowing us to reason about vector length and predicate in another pass.

This approach is low cost and benefits RISC-V and VE targets, that have vector length, and SVE and AVX-512 targets that can now make a more effective use of their predicated ISAs.



#### Basic Block from Loop Vectorizer

...

• • •

%evl = rvv.vsetvl( ... )

%W = op <vscale x k x ty> ... %V = op <vscale x k x ty> ... %Y = op <vscale x k x ty> %W, %V %X = op <vscale x k x ty> ...

%maskA = ... %maskB = ...

vp.store(%Y, %addrY, %maskA, %evl) vp.store(%X, %addrX, %maskB, %evl)

#### Emit Basic Block with VP instructions

%evl = rvv.vsetvl( ... )

%maskA = ...

%maskB = ....

%vp.W = vp.op <vscale x k x ty> ..., %maskA, %evl %vp.V = vp.op <vscale x k x ty> ..., %maskA, %evl %vp.Y = vp.op <vscale x k x ty> %vp.W, %vp.V, %maskA, %evl %vp.X = vp.op <vscale x k x ty> ..., %maskB, %evl

...

vp.store(%Y, %addrY, %maskY, %evl) vp.store(%Z, %addrZ, %mask, %evl)

Using Vector Predication does not mean we have to duplicate all the concepts in the Loop Vectorizer





Take a picture to go to the **PoC repository** 







| changes                                                                                                               | Upstream tail folding (for comparison)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| be<br>tipe                                                                                                            | <pre>void add_ref(long N, double *c, double *a, double *b) {     long i;     for (i = 0; i &lt; N; i++)         c[i] = a[i] + b[i]; }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |
|                                                                                                                       | Upstream (Tail folding)<br>vector.ph:<br>Successor(s): vector loop<br><x1> vector loop</x1>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |
| N vp<%2><br>vp<%1><br><%a>, vp<%4><br>%6>, vp<%3><br>r<%b>, vp<%4><br><%6>, vp<%3><br>r<%c>, vp<%4><br>vp<%6>, vp<%3> | <pre>vector.body:<br/>EMIT vp&lt;%2&gt; = CANONICAL-INDUCTION<br/>vp&lt;%3&gt; = SCALAR-STEPS vp&lt;%2&gt;, ir&lt;1&gt;<br/>EMIT vp&lt;%4&gt; = active lane mask vp&lt;%3&gt; vp&lt;%1&gt;<br/>CLONE ir&lt;%arrayidx&gt; = getelementptr ir&lt;%a&gt;, vp&lt;%3&gt;<br/>WIDEN ir&lt;%0&gt; = load ir&lt;%arrayidx&gt;, vp&lt;%4&gt;<br/>CLONE ir&lt;%arrayidx1&gt; = getelementptr ir&lt;%b&gt;, vp&lt;%3&gt;<br/>WIDEN ir&lt;%1&gt; = load ir&lt;%arrayidx1&gt;, vp&lt;%4&gt;<br/>WIDEN ir&lt;%add&gt; = fadd ir&lt;%0&gt;, ir&lt;%1&gt;<br/>CLONE ir&lt;%arrayidx2&gt; = getelementptr ir&lt;%c&gt;, vp&lt;%3&gt;<br/>WIDEN store ir&lt;%arrayidx2&gt;, ir&lt;%add&gt;, vp&lt;%4&gt;<br/>EMIT vp&lt;%11&gt; = VF * UF + vp&lt;%2&gt;<br/>EMIT branch-on-count vp&lt;%11&gt; vp&lt;%0&gt;<br/>No successors</pre> |  |  |  |  |  |
|                                                                                                                       | middle.block:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |  |

#### **Comparisons with TSVC-2**

#### TSVC-2 vtvtv

| 8224 li a0, 0                        | 18530 mv a1, s0                         |
|--------------------------------------|-----------------------------------------|
| 8225 .LBB152_2:                      | 18531.LBB152_2:                         |
| 8226                                 | 18532                                   |
| 8227                                 | 18533                                   |
| 8228 sub a1, s0, a0                  | 18534 slli a1, a1, 32                   |
| 8229 vsetvli a1, a1, e32, m1, ta, mu | 18535 srli a1, a1, 32                   |
| 8230 slli a2, a0, 2                  | 18536 vsetvli a1, a1, e64, m2, ta, mu   |
| 8231 add a3, s3, a2                  | 18537 slli a2, a0, 2                    |
| 8232 vle32.v v8, (a3)                | 18538 add a3, s3, a2                    |
| 8233 add a4, s10, a2                 | 18539 vle32.v v8, (a3)                  |
| 8234 vle32.v v9, (a4)                | 18540 add a4, s10, a2                   |
| 8235 add a2, a2, s1                  | 18541 vle32.v v9, (a4)                  |
| 8236 vle32.v v10, (a2)               | 18542 vsetvli zero, a1, e32, m1, ta, ma |
| 8237 vfmul.vv v8, v8, v9             | 18543 add a2, a2, s1                    |
| 8238 vfmul.vv v8, v8, v10            | 18544 vle32.v v10, (a2)                 |
| 8239 add a0, a0, a1                  | 18545 vfmul.vv v8, v8, v9               |
| 8240 vse32.v v8, (a3)                | 18546 vfmul.vv v8, v8, v10              |
| 8241 bne a0, s0, .LBB152_2           | 18547 vse32.v v8, (a3)                  |
| 8242 # %bb.3:                        | 18548 add a0, a0, a1                    |
| 8243                                 | 18549 sub a1, s0, a0                    |
| 8244 mv a0, s3                       | 18550 bne a0, s0, .LBB152_2             |
| Previous Approach                    | Current Approach                        |

No successors

#### TSVC-2 Test Loop 274

| _                   |                             |                         | 7070                |            |                                                                                                                |
|---------------------|-----------------------------|-------------------------|---------------------|------------|----------------------------------------------------------------------------------------------------------------|
| 7883.LB             | B71_2:                      |                         | 7978                | mv a1, s0  | 1                                                                                                              |
| 7884                |                             |                         | 7979.LBE            | 3/1_2:     |                                                                                                                |
| 7885                |                             |                         | 7980                |            |                                                                                                                |
| 7886                | sub a1, s0, a               |                         | 7981                | -11i -4    | -1 -2                                                                                                          |
| 7887                |                             | a1, e64, m2, ta, ma     | 7982                |            | ., a1, 32                                                                                                      |
| 7888                | slli a2, a                  | -                       | 7983                | srli a1    |                                                                                                                |
| 7889                | add a3, s3, a               |                         | 7984<br>7985        | slli a2    | l, a1, e64, m2, ta, mu                                                                                         |
| 7890                | vle32.v v8, (               |                         | 7986                |            |                                                                                                                |
| 7891                | add a3, s4, a               |                         |                     | add a3, s3 |                                                                                                                |
| 7892                | vle32.v v9, (               |                         | 7987                | vle32.v v9 | <i>a</i> .                                                                                                     |
| 7893                | add a3, s5, a               |                         | 7988                | add a3, s4 |                                                                                                                |
| 7894                | vle32.v v10,                | (a)                     | 7989                | vle32.v v1 | <i>a</i>                                                                                                       |
| 7895                | vid.v v12                   | 11 11 611               | 7990                | add a3, s5 | The second s |
| <b>7896</b><br>7897 |                             | v11, <b>v12, s11</b>    | <b>7991</b><br>7992 | vle32.v v1 | -                                                                                                              |
| 7898                |                             | , zero, e32, m1, ta, ma | 7992                |            | 3, zero, e32, m1, ta, ma<br>v <b>9</b> , v <b>10</b> , v1 <b>1</b>                                             |
| 7899                | vfmacc.vv v<br>add a3, s10, |                         | 7994                | add a3, s1 |                                                                                                                |
| 7900                | vmfgt.vf v                  |                         | 7995                | <i>.</i>   | ero, a1, e32, m1, ta, ma                                                                                       |
| 7901                | -                           | /0, v11, v12            | 7996                |            | v8, v9, fs0                                                                                                    |
| 7902                | vse32.v v8, (               |                         | 7997                | vmnot.m v0 |                                                                                                                |
| 7903                |                             | v9, v9, v10, v0.t       | 7998                | vse32.v v9 |                                                                                                                |
| 7904                | vse32.v v9, (               |                         | 7999                |            | v10, v10, v11, v0.t                                                                                            |
| 7905                | vmand.mm v                  | -                       | 8000                |            | .0, (a3), v0.t                                                                                                 |
| 7906                | add a2, a2, s               |                         | 8001                | add a2, a2 |                                                                                                                |
| 7907                | vle32.v v9, (               |                         | 8002                | vmv.v.v v0 | <i>a</i> .                                                                                                     |
| 7908                | r                           | /8, v8, v9, v0.t        | 8003                |            | 0, (a2), v0.t                                                                                                  |
| 7909                | add a0, a0, a               |                         | 8004                |            | v9, v9, v10, v0.t                                                                                              |
| 7910                | vse32.v v8, (               | (a2), v0.t              | 8005                | vse32.v v9 | ), (a2), v0.t                                                                                                  |
| 7911                | bne a0, s0, .               | LBB71_2                 | 8006                | add a0, a0 | ), a1                                                                                                          |
| 7912 # %            | bb.3:                       |                         | 8007                | sub a1, s0 | ), a0                                                                                                          |
| 7913                |                             |                         | 8008                | bne a0, s0 | ), .LBB71_2                                                                                                    |

**Previous Approach** 



This project has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 and Specific Grant Agreement No 101036168 (EPI SGA2). The JL innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland.

**Current Approach** 

The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by the UE NextGenerationEU/PRTR



Barcelona **Supercomputing** Center Vacional de Supercomputación