Created attachment 25475 [details] Source files to reproduce the bug When linking a program with a DLL using the /delayload switch, the first call to a function defined in the DLL will get bad value for (at least one of) the floating point parameters. Attached are 2 sources file my_lib.cpp and my_exe.cpp to reproduce the bug. They should be built as folow: - "C:\Program Files\LLVM\bin\clang-cl.exe" my_lib.cpp /link /DLL /OUT:my_dll.dll - "C:\Program Files\LLVM\bin\clang-cl.exe" /c my_exe.cpp /OUT:my_exe.obj - "C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe When running my_exe.exe, the output will be "1 0 3" instead of the expected "1 2 3". The last step can be replaced with "C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64\link.exe" my_dll.lib Delayimp.lib /delayload:my_dll.dll my_exe.obj /OUT:my_exe.exe to use link.exe with the same options or with "C:\Program Files\LLVM\bin\lld-link.exe" my_dll.lib my_exe.obj /OUT:my_exe.exe to use lld with /delayload. In both of those cases the resulting executable will give the expected "1 2 3". I believe the bug occurs because __delayLoadHelper2 (the function defined in delayimp.lib that actually loads the DLL and locate the function we want to call during the first usage) writes into the top of the stack space of its caller (I don't know why, is it a weird Windows caling convention?) but the thunk generated by lld doesn't that space. Specifically, the thunk generated by lld (for x64) looks like this: push rcx push rdx push r8 push r9 sub rsp,48h movdqa xmmword ptr [rsp],xmm0 movdqa xmmword ptr [rsp+10h],xmm1 movdqa xmmword ptr [rsp+20h],xmm2 movdqa xmmword ptr [rsp+30h],xmm3 mov rdx,rax lea rcx,[__xt_z+28h (01401C9E88h)] call __delayLoadHelper2 (01401A3464h) movdqa xmm0,xmmword ptr [rsp] movdqa xmm1,xmmword ptr [rsp+10h] movdqa xmm2,xmmword ptr [rsp+20h] movdqa xmm3,xmmword ptr [rsp+30h] add rsp,48h pop r9 pop r8 pop rdx pop rcx jmp rax (it allocates space on the stack and uses it to save the register prior to calling __delayLoadHelper2 and restore them later) Whereas the thunk generated by link.exe looked like that: mov qword ptr [rsp+8],rcx mov qword ptr [rsp+10h],rdx mov qword ptr [rsp+18h],r8 mov qword ptr [rsp+20h],r9 sub rsp,68h movdqa xmmword ptr [rsp+20h],xmm0 movdqa xmmword ptr [rsp+30h],xmm1 movdqa xmmword ptr [rsp+40h],xmm2 movdqa xmmword ptr [rsp+50h],xmm3 mov rdx,rax lea rcx,[__DELAY_IMPORT_DESCRIPTOR_my_dll (0140435020h)] call __delayLoadHelper2 (01400089C2h) movdqa xmm0,xmmword ptr [rsp+20h] movdqa xmm1,xmmword ptr [rsp+30h] movdqa xmm2,xmmword ptr [rsp+40h] movdqa xmm3,xmmword ptr [rsp+50h] mov rcx,qword ptr [rsp+70h] mov rdx,qword ptr [rsp+78h] mov r8,qword ptr [rsp+80h] mov r9,qword ptr [rsp+88h] add rsp,68h jmp __tailMerge_my_dll+77h (01402237B8h) jmp rax It looks very similar but, for some reason, it doesn't save the xmmX register on the top of the stack like lld, it leave 32 bytes that __delayLoadHelper2 is free to mess with. Indeed, (at least on my machine), the first 2 instruction of __delayLoadHelper2 are: mov qword ptr [rsp+10h],rbx mov qword ptr [rsp+18h],rsi which, if I'm not mistaken are writting into the stack space where xmm0 and xmm1 were saved.