Clang really shouldn't have ASM tests in it #27189

rengolin · 2016-03-02T18:44:25Z


Bugzilla Link	26815
Resolution	FIXED
Resolved on	Mar 09, 2016 12:56
Version	unspecified
OS	Linux
CC	@ahmedbougacha,@compnerd,@echristo,@jmolloy,@kbeyls,@RKSimon,@pogo59,@rengolin,@rotateright,@TNorthover,@vedantk

Extended Description

After enough discussions in the mailing list, and what I believe is now the consensus, we should move all ASM tests to LLVM and transform all Clang tests into IR tests.

The few cases that people were arguing about:

NEON intrinsics need to lower to NEON instructions one-to-one.

That's not true. I love that the back-end combines vmul+vadd into vmla, and it should still do it. Testing for a sequence of IR instructions instead (or builtin) is not worse.

Inline ASM needs to be as is in asm output.

No it doesn't. There are a number of cases where we (and GAS) do slight transformations on the asm and output a different code (ex. add r0, -1 -> sub r0, 1). Furthermore, inline asm has a nice IR representation, which we can also test.

Specific code-gen issues.

If C code gets translated into IR and then assembly, we can obviously always find an IR that will produce the same assembly. To do that, simple -emit-llvm.

--

Having said that, the move itself will be very tedious, but it is necessary.

Given that most people were complaining about the ARM and AArch64 tests, I think it's only fair that we, ARM folks, share the load into moving things. I'll copy as many people as I can into this bug so we can coordinate our efforts, but I'd like to have a much cleaner Clang test by 3.9.0.

rotateright · 2016-03-02T19:20:04Z

NEON intrinsics need to lower to NEON instructions one-to-one.

We had a variation on this argument for x86's obscene pile of SSE/AVX intrinsics: in unoptimized code, we thought it was important that the one-to-one intrinsic -> asm mapping should be true - because debugging those things is tough enough already. This is bug 24580.

However, after:
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20151123/143974.html

...we split those up. I don't think we ever answered if there was a designated place for end-to-end tests of this sort, but based on this recent llvm-dev thread:
http://lists.llvm.org/pipermail/llvm-dev/2016-February/095275.html

maybe test-suite is the right spot?

rengolin · 2016-03-02T19:42:44Z

We had a variation on this argument for x86's obscene pile of SSE/AVX
intrinsics: in unoptimized code, we thought it was important that the
one-to-one intrinsic -> asm mapping should be true - because debugging those
things is tough enough already. This is bug 24580.

I think this goal is counter-productive.

Imagine the case where the intrinsic is lowered to IR instructions, not builtins, and O0 generates so many wrapper code that the patterns don't match on the back-end.

You have a few not-ideal options:

Fudge the match to recognise more variations, which could reduce the certainty of the semantics.
Change Clang to reduce the wrapper code for intrinsics-to-IR, which would create special cases and increase maintenance.
Force Clang to emit intrinsics, stopping optimisations down the pipeline.

All for the sake of an identity that is not really necessary.

If, for any target, this identity is mandated by the ABI, then Clang must generate builtins for ALL intrinsics, and it's up to the back-end (or a target-specific middle-end pass) to convert to IR on early-O1+.

In that case, just checking the the IR has the builtin you want is, now, enough for a Clang test.

rotateright · 2016-03-02T20:22:09Z

Force Clang to emit intrinsics, stopping optimisations down the pipeline.

All for the sake of an identity that is not really necessary.

For x86 at least, I don't think there's any concern about preventing optimizations.

We've mostly settled on doing these transforms in InstCombineCalls.cpp. That let's us write readable/debuggable C++ code to do the transform, ensures that we're doing the transform to IR very early, and doesn't interfere with the programmer's intrinsic-based intent when building at -O0.

There are cases where we transform to native IR instructions in the header files. In addition to the vector programmer's debugging concerns, consider that we have macros in headers that look like this:

/* Vector shuffle */
#define _mm256_shuffle_ps(a, b, mask) extension ({
(__m256)__builtin_shufflevector((__v8sf)(__m256)(a),
(__v8sf)(__m256)(b),
(mask) & 0x3,
((mask) & 0xc) >> 2,
(((mask) & 0x30) >> 4) + 8,
(((mask) & 0xc0) >> 6) + 8,
((mask) & 0x3) + 4,
(((mask) & 0xc) >> 2) + 4,
(((mask) & 0x30) >> 4) + 12,
(((mask) & 0xc0) >> 6) + 12); })

Granted, not having an x86-specific builtin for this case is nice...the proliferation of x86 intrinsics is unstoppable. Sane instruction set architecture may not have this problem.

Some more details about the optimization options for intrinsics are here:
http://reviews.llvm.org/D10555

TNorthover · 2016-03-03T01:13:01Z

The complaint isn't really about asm tests, but about running LLVM optimisations. And unfortunately, I think that's the more burdensome one to fix. The IR we produce for

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

is:

define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
entry:
%__p0.addr.i = alloca <2 x float>, align 8
%__p1.addr.i = alloca <2 x float>, align 8
%__ret.i = alloca <2 x float>, align 8
%l.addr = alloca <2 x float>, align 8
%r.addr = alloca <2 x float>, align 8
store <2 x float> %l, <2 x float>* %l.addr, align 8
store <2 x float> %r, <2 x float>* %r.addr, align 8
%0 = load <2 x float>, <2 x float>* %l.addr, align 8
%1 = load <2 x float>, <2 x float>* %r.addr, align 8
store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
%2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
%3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
%add.i = fadd <2 x float> %2, %3
store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
%4 = load <2 x float>, <2 x float>* %__ret.i, align 8
ret <2 x float> %4
}

Testing that code, or even working out whether it's correct by eye, is nontrivial.

On the other hand, breaking clang tests with an LLVM change can't be fun. I wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd expect it to clean up the generated NEON code massively and it's a very stable pass with a well-defined job.

TNorthover · 2016-03-03T05:44:59Z

Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Other than that, a related question we might want to consider is just how thorough we want these NEON tests to be. I suspect they currently have large but not complete coverage of arm_neon.h, and they take a fairly ridiculous amount of time to execute.

Would a more orthogonal subset be better, especially now that we have the emperor tests in the test-suite? Not sure. And not sure how we'd decide which variants are worth testing either.

We can probably delay answering those questions to later though.

TNorthover · 2016-03-03T06:23:03Z

Oh, and I also thoroughly agree with Renato that we make absolutely no guarantees about intrinsics mapping to instructions in the ARM world. arm_neon.h describes semantics, not the assembly code output.

rengolin · 2016-03-03T11:29:27Z

There's also the question of how much can we really do in LLVM's side. Whenever Clang changes to support new intrinsics, so does LLVM.

Meaning the same Clang+LLVM revision is known to work and emit the expected assembly, but not necessarily an old Clang (or an LLVM test produced with an old Clang) in a new LLVM revision.

By removing the tests from Clang, we may inadvertently cause the same effect in LLVM's side, by having old IR patterns not match because of some new optimisation pass.

In order to solve this in a reasonable fashion, the best approach is to auto-generate LLVM tests from Clang's "current" output. This can be done with a mix of original C code, a pass to IR and concatenating some RUN/CHECK lines at the begin/end.

Even though this is only valid for a small class of problems, it's probably the class that will generate most grief in LLVM testing. The rest will hopefully be more stable.

I recommend that the first step becomes duplicating the tests in the LLVM side and see how they end up as. As we accept the odds, we start removing the tests on Clang's side, and we'll be left with only the hairy cases in Clang.

llvmbot · 2016-03-03T17:58:37Z

Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Half-baked idea: this seems like something that'd be more generally useful than a one-off script. Maybe it's worth teaching FileCheck more about llvm's textual IR to make it do this transformation semi-automagically (under some new flag, of course):

// RUN: %cc1 %s -emit-llvm -o - | FileCheck --auto-ir-vars %s

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

// CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
// CHECK-NEXT: entry:
// CHECK-DAG: %__p0.addr.i = alloca <2 x float>, align 8
// CHECK-DAG: %__p1.addr.i = alloca <2 x float>, align 8
// CHECK-DAG: %__ret.i = alloca <2 x float>, align 8

</hand waving>

such that the %vars get automatically turned into the appropriate name-agnostic [[VAR:%[a-zA-Z_][a-zA-z_0-9]+]] thing for defs, and [[VAR]] for uses.

rotateright · 2016-03-03T18:21:12Z

The complaint isn't really about asm tests, but about running LLVM
optimisations. And unfortunately, I think that's the more burdensome one to
fix. The IR we produce for

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

is:

define <2 x float> @res(<2 x float> %l, <2 x float> %r) #0 {
entry:
%__p0.addr.i = alloca <2 x float>, align 8
%__p1.addr.i = alloca <2 x float>, align 8
%__ret.i = alloca <2 x float>, align 8
%l.addr = alloca <2 x float>, align 8
%r.addr = alloca <2 x float>, align 8
store <2 x float> %l, <2 x float>* %l.addr, align 8
store <2 x float> %r, <2 x float>* %r.addr, align 8
%0 = load <2 x float>, <2 x float>* %l.addr, align 8
%1 = load <2 x float>, <2 x float>* %r.addr, align 8
store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
%2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
%3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
%add.i = fadd <2 x float> %2, %3
store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
%4 = load <2 x float>, <2 x float>* %__ret.i, align 8
ret <2 x float> %4
}

Testing that code, or even working out whether it's correct by eye, is
nontrivial.

On the other hand, breaking clang tests with an LLVM change can't be fun. I
wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd
expect it to clean up the generated NEON code massively and it's a very
stable pass with a well-defined job.

Is it necessary to have complete checking for this kind of case? I think we just need one label and one check line to know that clang has done the intended job of translating the intrinsic to IR:

; CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r)
; CHECK: fadd <2 x float>

We can assume that the rest of the IR is correct and covered by tests elsewhere? (Surprisingly, I don't see any x86 equivalents for these kinds of tests...)

rengolin · 2016-03-03T18:27:23Z

Is it necessary to have complete checking for this kind of case? I think we
just need one label and one check line to know that clang has done the
intended job of translating the intrinsic to IR:

; CHECK-LABEL: define <2 x float> @res(<2 x float> %l, <2 x float> %r)
; CHECK: fadd <2 x float>

Unfortunately, that doesn't cover.

For silly semantics (vadd_f32 -> fadd float) should be enough, but others, like zip, vldN and the ones that rely on shuffles, casts, promotions and truncations, will all need a strict sequence of IR instructions with the correct dependency between them.

However, having CHECK for just the sequence of instructions will be "enough" for most purposes, though, so we may only need full-blown variable usage for the few complex cases...

rotateright · 2016-03-03T18:34:31Z

For silly semantics (vadd_f32 -> fadd float) should be enough, but others,
like zip, vldN and the ones that rely on shuffles, casts, promotions and
truncations, will all need a strict sequence of IR instructions with the
correct dependency between them.

However, having CHECK for just the sequence of instructions will be "enough"
for most purposes, though, so we may only need full-blown variable usage for
the few complex cases...

Agreed. It looks like Altivec is tested quite efficiently this way:
clang/test/CodeGen/builtins-ppc-altivec.c

ahmedbougacha · 2016-03-03T18:42:35Z

What's wrong with a very strict matching of the entire (unoptimized) IR sequence, (à la update_llc_test_checks.py)?

We assume they're correct today so in theory we don't even need to look at the current IR.

We don't expect these to change often, and when they do, updating the output and git-diff ought to be enough. I think I'd want to know if some unrelated IRGen commit changed the output (it's not unrelated anymore, is it?)

Yes, it's some extra work when something does changes. But if we provide an easy one-line script to update the checks, the change author can either ignore it (the current status quo), or make sure they didn't break anything (strictly better than the status quo).

TNorthover · 2016-03-03T18:55:31Z

I'd much prefer to check dataflow in all cases. It's very easy to end up with "fadd %a, %a" or get the subtraction the wrong way round (we did that originally, in fact, and it took us 2 years to fix properly because people had started relying on the obviously-wrong operand order).

TNorthover · 2016-03-03T18:59:03Z

I still think having readable tests is more important than wanting to completely isolate Clang from LLVM's mid-end.

rotateright · 2016-03-03T19:03:54Z

What's wrong with a very strict matching of the entire (unoptimized) IR
sequence, (à la update_llc_test_checks.py)?

We assume they're correct today so in theory we don't even need to look at
the current IR.

We don't expect these to change often, and when they do, updating the output
and git-diff ought to be enough. I think I'd want to know if some unrelated
IRGen commit changed the output (it's not unrelated anymore, is it?)

Yes, it's some extra work when something does changes. But if we provide an
easy one-line script to update the checks, the change author can either
ignore it (the current status quo), or make sure they didn't break anything
(strictly better than the status quo).

Tim mentioned in comment 5 that these tests may be taking more time than they're worth. But an "update_clang_test_checks.py" and "update_opt_test_checks.py" would be extremely useful in general, and hopefully that's where we're headed. :)

rengolin · 2016-03-03T19:10:13Z

I'd much prefer to check dataflow in all cases. It's very easy to end up
with "fadd %a, %a" or get the subtraction the wrong way round (we did that
originally, in fact, and it took us 2 years to fix properly because people
had started relying on the obviously-wrong operand order).

I'm not against it, just trying to be pragmatic.

We don't have perfect tests today, so we don't need perfect tests to replace them. But if people are happy to make them better at the same time as moving, I'm game.

compnerd · 2016-03-03T19:14:58Z

I still think having readable tests is more important than wanting to
completely isolate Clang from LLVM's mid-end.

I fully agree with you here. The thing is that builtins are slightly more involved for this exact reason: they expect to be able to bore through the frontend and hook into the mid or back ends. I believe a similar argument has been made in favor of intrinsics over builtins in the past. As already pointed out (by you!) the data flow analysis here is relevant and important since it is possible to break things subtly otherwise. I think that the readability of the tests directly impact that.

TNorthover · 2016-03-09T20:14:30Z

I've put a patch up for review at http://reviews.llvm.org/D17999.

TNorthover · 2016-03-09T20:56:46Z

I think this has been addressed by r263048.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clang really shouldn't have ASM tests in it #27189

Clang really shouldn't have ASM tests in it #27189

rengolin commented Mar 2, 2016

rotateright commented Mar 2, 2016

rengolin commented Mar 2, 2016

rotateright commented Mar 2, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

rengolin commented Mar 3, 2016

llvmbot commented Mar 3, 2016

rotateright commented Mar 3, 2016

rengolin commented Mar 3, 2016

rotateright commented Mar 3, 2016

ahmedbougacha commented Mar 3, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

rotateright commented Mar 3, 2016

rengolin commented Mar 3, 2016

compnerd commented Mar 3, 2016

TNorthover commented Mar 9, 2016

TNorthover commented Mar 9, 2016

Clang really shouldn't have ASM tests in it #27189

Clang really shouldn't have ASM tests in it #27189

Comments

rengolin commented Mar 2, 2016

Extended Description

rotateright commented Mar 2, 2016

rengolin commented Mar 2, 2016

rotateright commented Mar 2, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

rengolin commented Mar 3, 2016

llvmbot commented Mar 3, 2016

rotateright commented Mar 3, 2016

rengolin commented Mar 3, 2016

rotateright commented Mar 3, 2016

ahmedbougacha commented Mar 3, 2016

TNorthover commented Mar 3, 2016

TNorthover commented Mar 3, 2016

rotateright commented Mar 3, 2016

rengolin commented Mar 3, 2016

compnerd commented Mar 3, 2016

TNorthover commented Mar 9, 2016

TNorthover commented Mar 9, 2016