Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash with openGL function glRasterPos2i() when using r600 driver if mesa is built with llvm 3.7 libs #25395

Open
llvmbot opened this issue Oct 2, 2015 · 17 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 2, 2015

Bugzilla Link 25021
Version 3.7
OS Linux
Reporter LLVM Bugzilla Contributor
CC @topperc,@foutrelis,@rnk,@rotateright

Extended Description

I use archlinux 64 bits, graphic card : amd radeon HD4650 Pcie, cpu : intel pentium dual core E6800 3.33 Ghz,

when building mesa 11.0.2 with llvm 3.7.0 version then a bug will occur with the r600 driver when a program uses the openGL function "glRasterPos2i()", for example the test program "tunnel" provided by the mesa-demos package, this program will crash with the error "illegal instruction",

flightgear 3.4 will also crash at startup ( because it uses the glRasterPos2i() function ),

the workaround is to build mesa 11.0.2 with the previous version of LLVM ( 3.6.2 ), there is no bug if LLVM 3.6.2 and llvm-libs 3.6.2 are used during the build of mesa 11.0.2,

I create first a bugreport in mesa website :

https://bugs.freedesktop.org/show_bug.cgi?id=92214

I thought it was mesa 11.0.2 the culprit but in fact it's llvm 3.7.0 the real culprit,

something is wrong in llvm 3.7.0,

I use also gcc-multilib 5.2.0-2, glibc 2.22-3, maybe the combination between glibc 2.22-3 and llvm 3.7.0 is not good

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 7, 2015

I tried with the svn version of llvm ( revision 249579 ) and the bug is still here,

I will try to find the faulty svn revision where the bug has been introduced

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 13, 2015

I didn't manage to find the faulty commit, the bisect process is very slow with my PC ( my CPU is not fast ) and there are a lot of svn revision to test,

but I made an interesting discovery : the bug occurs also in a virtual machine ( qemu i686, OS guest : archlinux i686, OS host : archlinux 64 bits, CPU: pentium dual core E6800 ),

in this virtual machine it's not the r600 driver who is used, it's the swrast_dri.so file ( 100% emulation software, no 3D acceleration ),

in this virtual machine all openGL programs crash ( glxgears for example ), with the error "illegal instruction",

this qemu i686 virtual machine runs in my PC ( OS host : archlinux 64 bits, CPU: pentium dual core E6800 ),

glxinfo for this qemu VM :

name of display: :0
display: :0 screen: 0
direct rendering: Yes
server glx vendor string: SGI
server glx version string: 1.4
OpenGL vendor string: VMware, Inc.
OpenGL renderer string: Gallium 0.4 on llvmpipe (LLVM 3.7, 128 bits)
OpenGL version string: 3.0 Mesa 11.0.3
OpenGL shading language version string: 1.30

log of Xorg :

[ 13.255] (WW) Open ACPI failed (/var/run/acpid.socket) (No such file or directory)
[ 13.533] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[ 13.962] (II) Loading /usr/lib/xorg/modules/drivers/vmware_drv.so
[ 14.948] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[ 14.978] (II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so
[ 15.010] (II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so
[ 15.060] (II) Loading /usr/lib/xorg/modules/libfbdevhw.so
[ 15.270] (II) Loading /usr/lib/xorg/modules/libvgahw.so
[ 15.281] (==) vmware(0): Using HW cursor
[ 15.282] (II) Loading /usr/lib/xorg/modules/libfb.so
[ 15.367] (II) Loading /usr/lib/xorg/modules/libshadowfb.so

[ 13.962] (II) Loading /usr/lib/xorg/modules/drivers/vmware_drv.so
[ 14.948] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[ 14.978] (II) Loading /usr/lib/xorg/modules/drivers/fbdev_drv.so
[ 15.010] (II) Loading /usr/lib/xorg/modules/drivers/vesa_drv.so
[ 15.053] (II) vmware: driver for VMware SVGA: vmware0405, vmware0710
[ 15.053] (II) FBDEV: driver for framebuffer: fbdev
[ 15.053] (II) VESA: driver for VESA chipsets: vesa

the mesa driver is swrast_dri.so,

the backtrace is still the same :

Starting program: /usr/bin/glxgears
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0xb450eb40 (LWP 839)]
[New Thread 0xb3d0db40 (LWP 840)]

Program received signal SIGILL, Illegal instruction.
0xb7fd2091 in ?? ()

Thread 3 (Thread 0xb3d0db40 (LWP 840)):
#​0 0xb7fdbbc8 in __kernel_vsyscall ()
No symbol table info available.
#​1 0xb7a7da2b in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
No symbol table info available.
#​2 0xb7c8de4d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libc.so.6
No symbol table info available.
#​3 0xb757940a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​4 0xb7579275 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​5 0xb7a78315 in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.
#​6 0xb7c80e1e in clone () from /usr/lib/libc.so.6
No symbol table info available.

Thread 2 (Thread 0xb450eb40 (LWP 839)):
#​0 0xb7fdbbc8 in __kernel_vsyscall ()
No symbol table info available.
#​1 0xb7a7da2b in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
No symbol table info available.
#​2 0xb7c8de4d in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libc.so.6
No symbol table info available.
#​3 0xb757940a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​4 0xb7579275 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​5 0xb7a78315 in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.
#​6 0xb7c80e1e in clone () from /usr/lib/libc.so.6
No symbol table info available.

Thread 1 (Thread 0xb7a5f700 (LWP 838)):
#​0 0xb7fd2091 in ?? ()
No symbol table info available.
#​1 0xb7367986 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​2 0xb7367d36 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​3 0xb729bc19 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​4 0xb72942e3 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​5 0xb72948c6 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​6 0xb7577813 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​7 0xb72820fd in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​8 0xb713d166 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​9 0xb7125f5a in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​10 0xb6ff0600 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​11 0xb7004b40 in ?? () from /usr/lib/xorg/modules/dri/swrast_dri.so
No symbol table info available.
#​12 0x08049fdb in ?? ()
No symbol table info available.
#​13 0x080496ca in ?? ()
No symbol table info available.
#​14 0xb7baf497 in __libc_start_main () from /usr/lib/libc.so.6
No symbol table info available.
#​15 0x08049d0a in ?? ()
No symbol table info available.

it could be a problem in llvm 3.7.0 if there are some faulty code for intel CPU pentium dual core E6800,

if I use llvm 3.6.2 lib then there is no bug if mesa 11.0.3 is linked to llvm 3.6.2 lib

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 13, 2015

another discovery :

in qemu I can set a type of CPU ( pentium, pentium2, pentium2, core2duo, SandyBridge and many more ), you can see the CPUs list with the command "qemu-i386 -cpu ?",

until now I used the qemu option "-cpu host", which means that it's the CPU of the host who is emulated ( my pentium dual core E6800 ),

then I decided to set a different CPU name in my qemu script :

-cpu core2duo -enable-kvm -machine type=pc,accel=kvm -smp 2

with this setting the bug disapears, all is ok in my virtual machine, glxgears and all openGL programs can run without crash, the mesa driver llvmpipe doesn't crash,

after that I decided to do set again another CPU in qemu :

-cpu Penryn -enable-kvm -machine type=pc,accel=kvm -smp 2 \

with "Penryn" CPU the bug is back in my virtual machine, which means that the bug seems related to the type of CPU, llvm 3.7.0 lib may have a bug when he tries to generate binary code, it fails with some CPUs,

this problem doesn't exist with llvm 3.6.2 lib

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 13, 2015

I did further tests,

I found that llvm 3.7.0 see my CPU pentium dual core like a "Penryn" :

$ llc --version | grep CPU
Host CPU: penryn

but llvm 3.6.2 ( who doesn't have the bug ) see my CPU pentium dual core like a "core2"

$ llc --version | grep CPU
Host CPU: core2

why this different behaviour ?

it could explain this bug if llvm 3.7.0 generates binary code for a wrong cpu ( penryn instead of core 2 )

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 14, 2015

the git commit who has introduced this bug is :

cd83d5b

llvm-mirror/llvm@cd83d5b

the problem is that llvm 3.7.0 treats my pentium dual core as a "penryn",

penryn supports SSE4, but not the pentium dual core series ( CPU family 6 model 23 ),

the faulty commit has deleted a test about SSE4 :

return HasSSE41 ? "penryn" : "core2";

the solution is simply to add this test for CPU family 6 model 23, here is the patch who solves this bug :

--- a/lib/Support/Host.cpp 2015-10-14 07:13:52.381374679 +0200
+++ b/lib/Support/Host.cpp 2015-10-14 07:13:28.224708323 +0200
@@ -332,6 +332,8 @@
// 17h. All processors are manufactured using the 45 nm process.
//
// 45nm: Penryn , Wolfdale, Yorkfield (XE)

  •    // Not all Penryn processors support SSE 4.1 (such as the Pentium brand)
    
  •    return HasSSE41 ? "penryn" : "core2";          
     case 29: // Intel Xeon processor MP. All processors are manufactured using
              // the 45 nm process.
       return "penryn";
    

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 14, 2015

add a test about SSE4 for CPU family 6 model 23
this patch adds a test about SSE4 for CPU family 6 model 23,

this patch solves the bug, because pentium dual core CPUs don't have the SSE4 extension, so they need to be treated as "core2" and not "penryn"

@foutrelis
Copy link
Mannequin

foutrelis mannequin commented Oct 14, 2015

@​Craig Topper: Was the removal of the HasSSE41 conditional intentional? Asking because the commit message only mentions AVX.

@topperc
Copy link
Collaborator

topperc commented Oct 22, 2015

Yes the removal was intentional. If you're autodetecting CPU name you should autodetect CPU features as well using getHostCPUFeatures. The AVX problem was clearly worse because we were downgrading Haswell processors without AVX all the way down to Nehalem. This removed not only AVX support, but BMI, LZCNT, RDRND, etc. It also reverted the scheduling model back to Nehalem as well. This would have kept continuing going forward as there will probably always be CPUs that don't support AVX and we couldn't just keep calling them all Nehalem.

The case with SSE41 and penryn is not as severely limiting, but I wanted to cleanup all such behavior.

I'm assuming if you run Mesa on a Sandybridge or Haswell that doesn't have AVX you will get other failures because we don't change the CPU name to "corei7".

Can Mesa use the getHostCPUFeatures function to fix this completely?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 22, 2015

Can Mesa use the getHostCPUFeatures function to fix this completely?

for now mesa developpers don't use the "getHostCPUFeatures" function, they use their own feature detection :

https://bugs.freedesktop.org/show_bug.cgi?id=92214#c32

the problem is that mesa developpers don't really tell to llvm compiler which cpu features must be used, they think that llvm will automatically choose the right cpu features,

as I said I think the problem is that my pentium dual core ( cpu family 6 model 23 ) is treated as "penryn" by llvm 3.7.0, and it seems that by default the SSE4 extension is used by llvm compiler when the cpu name is "penryn" and when no cpu features arguments have been passed to the llvm compiler,

the problem disapears when I patch the file /lib/Support/Host.cpp in order to change the detection of my pentium dual core by llvm ( "core2" instead of "penryn" )

I don't know if the problem can be solved easily by mesa developpers

@topperc
Copy link
Collaborator

topperc commented Oct 22, 2015

Can you try modifying the Mesa patch from Jose to push "-sse4.1" if sse41 is detected as not supported?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 22, 2015

excellent advice Craig, it works !

it works by adding this to the Jose's patch :

  • if (!util_cpu_caps.has_sse4_1) {
    +#if HAVE_LLVM >= 0x0304
  •  MAttrs.push_back("-sse4.1");
    

+#else

  •  MAttrs.push_back("-sse41");
    

+#endif

  • }

so the logic is now to remove the unsupported CPU features ( like SSE4 ) in the list of cpu features arguments instead of adding the supported features ?

this logic is not really natural for a developper who wants to use llvm lib, this developper would think that llvm will never use an unsupported cpu feature if this developper only pass good cpu features to the compiler,

that's why I think the main problem is to treat pentium dual core as penryn, penryn cpu has SSE4, but not pentium dual core cpus, and it seems that llvm will try to use by himself SSE4.1 even if the developper didn't add explicitely "+sse4.1" in his source code,

in my logic llvm should be more stric, rigorous when he tries to associate a cpu with a cpu name,

a cpu name should reflect exactly the cpu features,

maybe you should create a new cpuname who targets cpu family 6 model 23 : "dualcore" in order to avoid this SSE4 problem ?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 23, 2015

I must say I find llvm's behavior quite crazy here. Is this really expected? We never tell it the cpu name, llvm figures that out all by itself. Now if it wants to use scheduling model of a given generation of cpu even though that cpu doesn't actually support all features this generation has, that's fine, but that it also implicitly assumes then that all features of this generation are available that doesn't make much sense imho.
Explicitly having to tell llvm it can't use these features even though the cpu detection was all done by llvm itself (well, we do set the cpu name, but it's just a llvm::sys::getHostCPUName() followed by builder.setMCPU(MCPU)) seems very counterintuitive and just plain wrong (we might not even know about all features which exist and llvm might try to use by mistake!).
Would the getHostCPUFeatures() function actually contain all the "negative" attributes so this could be used to set mattrs?

@topperc
Copy link
Collaborator

topperc commented Oct 23, 2015

I'll agree the interface isn't ideal. Ideally we'd have a function that returned the CPU name AND the feature list. But since we already had the getHostCPUName function and a separate, but empty and unused getHostCPUFeatures function at the time I fixed the AVX bug I left them separate.

As I've said earlier, the CPU name alone is insufficient due the way Intel chooses to enable and disable features within a given CPU family. We would need to add a separate CPU name for every possible combination of features Intel may choose to ship within a given family.

The behavior of setCPU without providing an explicit feature list is consistent with what you would get if you passed -march=penryn to gcc or clang. Both would enable sse4.1 instructions.

At one point in the past setCPU took "native" (or maybe empty) and did all the autodetection correctly. But all of that code was moved to Host.cpp and drivers were made responsible for calling the detection code before calling setCPU. This unfortunately created the AVX problem.

As of right now getHostCPUFeatures will detect every feature the x86 backend is aware of and its output can be used to set mattrs correctly. llc and clang are both doing this if -march=native is passed on their command lines.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 23, 2015

I notice that gcc 5.2.0 sees my pentium dual core as "core2" ( not "penryn" ) if I use "-march=native" :

$ gcc -march=native -Q --help=target | grep march
-march= core2

and more interesting : in 2013 here in llvm's bugzilla someone has already opened a bug report about the same problem ( his pentium dual core was treaten as penryn instead of core2 by llvm, which triggers bug about SSE4.1 ) :

https://llvm.org/bugs/show_bug.cgi?id=16721

Benjamin Kramer fixed the problem by adding the "SSE4.1 test" in /lib/Support/Host.cpp ) :

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130729/182469.html

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 23, 2015

As of right now getHostCPUFeatures will detect every feature the x86 backend
is aware of and its output can be used to set mattrs correctly. llc and
clang are both doing this if -march=native is passed on their command lines.
Hmm ok, seems reasonable enough if getHostCPUFeatures is defined to work that way.

@topperc
Copy link
Collaborator

topperc commented Oct 23, 2015

Turns out gcc doesn't have an -march="penryn". My mistake. So I guess gcc doesn't have a -march option that will corresponds to a CPU with sse4.1, but not sse4.2.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 23, 2015

Turns out gcc doesn't have an -march="penryn". My mistake. So I guess gcc
doesn't have a -march option that will corresponds to a CPU with sse4.1, but
not sse4.2.

here are the sse features enabled by default in gcc when -march=native ( aka "core2" with my configuration ) with a pentium dual core cpu :

$ gcc -march=native -Q --help=target | grep sse
-mno-sse4 [enabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [disabled]
-msse4.1 [disabled]
-msse4.2 [disabled]
-msse4a [disabled]
-msse5
-msseregparm [disabled]
-mssse3 [enabled]

someone has made a interesting suggestion two years ago in order to solve this problem :

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130729/182588.html

the idea is to remove SSE4.1 in lib/Target/X86/X86.td file for Penryn, because idealy a family name cpu should reflect the "common" features that ALL cpus share inside this family, it will solve all the problems related to SSE4.1, AVX2 when a CPU doesn't support one of these extensions,

because belong to a family of CPU does not mean to share all its features, so a restriction should be make in lib/Target/X86/X86.td for some problematic CPU families like penryn,

if we apply this idea it will give this :

def : ProcessorModel<"penryn", SandyBridgeModel,
[FeatureSSSE3, FeatureCMPXCHG16B, FeatureSlowBTMem]>;

and if the developper wants to know the complete features of the CPU host then he can use the function getHostCPUFeatures() in order to set correctly mattrs vector

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants