LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 52599 - lld-link /delayload - first call of a function with bad floating point parameter on x64
Summary: lld-link /delayload - first call of a function with bad floating point parame...
Status: NEW
Alias: None
Product: lld
Classification: Unclassified
Component: COFF (show other bugs)
Version: 13.0
Hardware: PC Windows NT
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-24 08:49 PST by Thomas Ferrand
Modified: 2021-11-24 08:49 PST (History)
1 user (show)

See Also:
Fixed By Commit(s):


Attachments
Source files to reproduce the bug (500 bytes, application/x-zip-compressed)
2021-11-24 08:49 PST, Thomas Ferrand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Ferrand 2021-11-24 08:49:48 PST
Created attachment 25475 [details]
Source files to reproduce the bug

When linking a program with a DLL using the /delayload switch, the first call to a function defined in the DLL will get bad value for (at least one of) the floating point parameters.

Attached are 2 sources file my_lib.cpp and my_exe.cpp to reproduce the bug. They should be built as folow:
 - "C:\Program Files\LLVM\bin\clang-cl.exe" my_lib.cpp /link /DLL /OUT:my_dll.dll
 - "C:\Program Files\LLVM\bin\clang-cl.exe" /c my_exe.cpp /OUT:my_exe.obj
 - "C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe

When running my_exe.exe, the output will be "1 0 3" instead of the expected "1 2 3".

The last step can be replaced with
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe
to use link.exe with the same options or with
"C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib my_exe.obj /OUT:my_exe.exe
to use lld with /delayload. In both of those cases the resulting executable will give the expected "1 2 3".

I believe the bug occurs because __delayLoadHelper2 (the function defined in delayimp.lib that actually loads the DLL and locate the function we want to call during the first usage) writes into the top of the stack space of its caller (I don't know why, is it a weird Windows caling convention?) but the thunk generated by lld doesn't that space.

Specifically, the thunk generated by lld (for x64) looks like this:
push        rcx  
push        rdx  
push        r8  
push        r9  
sub         rsp,48h  
movdqa      xmmword ptr [rsp],xmm0  
movdqa      xmmword ptr [rsp+10h],xmm1  
movdqa      xmmword ptr [rsp+20h],xmm2  
movdqa      xmmword ptr [rsp+30h],xmm3  
mov         rdx,rax  
lea         rcx,[__xt_z+28h (01401C9E88h)]  
call        __delayLoadHelper2 (01401A3464h)  
movdqa      xmm0,xmmword ptr [rsp]  
movdqa      xmm1,xmmword ptr [rsp+10h]  
movdqa      xmm2,xmmword ptr [rsp+20h]  
movdqa      xmm3,xmmword ptr [rsp+30h]  
add         rsp,48h  
pop         r9  
pop         r8  
pop         rdx  
pop         rcx  
jmp         rax

(it allocates space on the stack and uses it to save the register prior to calling __delayLoadHelper2 and restore them later)

Whereas the thunk generated by link.exe looked like that:
mov         qword ptr [rsp+8],rcx  
mov         qword ptr [rsp+10h],rdx  
mov         qword ptr [rsp+18h],r8  
mov         qword ptr [rsp+20h],r9  
sub         rsp,68h  
movdqa      xmmword ptr [rsp+20h],xmm0  
movdqa      xmmword ptr [rsp+30h],xmm1  
movdqa      xmmword ptr [rsp+40h],xmm2  
movdqa      xmmword ptr [rsp+50h],xmm3  
mov         rdx,rax  
lea         rcx,[__DELAY_IMPORT_DESCRIPTOR_my_dll (0140435020h)]  
call        __delayLoadHelper2 (01400089C2h)  
movdqa      xmm0,xmmword ptr [rsp+20h]  
movdqa      xmm1,xmmword ptr [rsp+30h]  
movdqa      xmm2,xmmword ptr [rsp+40h]  
movdqa      xmm3,xmmword ptr [rsp+50h]  
mov         rcx,qword ptr [rsp+70h]  
mov         rdx,qword ptr [rsp+78h]  
mov         r8,qword ptr [rsp+80h]  
mov         r9,qword ptr [rsp+88h]  
add         rsp,68h  
jmp         __tailMerge_my_dll+77h (01402237B8h)  
jmp         rax

It looks very similar but, for some reason, it doesn't save the xmmX register on the top of the stack like lld, it leave 32 bytes that __delayLoadHelper2 is free to mess with.

Indeed, (at least on my machine), the first 2 instruction of __delayLoadHelper2 are:
mov         qword ptr [rsp+10h],rbx  
mov         qword ptr [rsp+18h],rsi  

which, if I'm not mistaken are writting into the stack space where xmm0 and xmm1 were saved.